SlideShare a Scribd company logo
‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬
‫چگونه؟‬
VAHID AMIRI
VAHIDAMIRY.IR
VAHID.AMIRY@GMAIL.COM
Big DataData Data Processing
Data Gathering
Data Storing
عصر کلان داده، چرا و چگونه؟
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
Big Data: 3V’s
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Volume
Variety (Complexity)
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
A Single View to the Customer
Customer
Social
Media
Gaming
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchase
Velocity (Speed)
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
Social media and networks
(all of us are generating data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Some Make it 4V’s
 The Model of Generating/Consuming Data has Changed
The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Solution
Big
Data
Big
Comput
ation
Big
Computer
Big Data Solutions
 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop
عصر کلان داده، چرا و چگونه؟
Spark Stack
 More than just the Elephant in the room
 Over 120+ types of NoSQL databases
So many NoSQL options
 Extend the Scope of RDBMS
 Caching
 Master/Slave
 Table Partitioning
 Federated Tables
 Sharding
NoSql
 Relational database (RDBMS) technology
 Has not fundamentally changed in over 40 years
 Default choice for holding data behind many web apps
 Handling more users means adding a bigger server
RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement
CAP Theorem
 “Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”
“Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
 CA
 Highly-available consistency
 CP
 Enforced consistency
 AP
 Eventual consistency
CAP Theorem
Flavors of NoSQL
 Schema-less
 State (Persistent or Volatile)
 Example:
 Redis
 Amazon DynamoDB
Key / Value Database
 Wide, sparse column sets
 Schema-light
 Examples:
 Cassandra
 HBase
 BigTable
 GAE HR DS
Column Database
 Use for data that is
 document-oriented (collection of JSON documents) w/semi structured
data
 Encodings include XML, YAML, JSON & BSON
 binary forms
 PDF, Microsoft Office documents -- Word, Excel…)
 Examples: MongoDB, CouchDB
Document Database
Graph Database
Use for data with
 a lot of many-to-many relationships
 when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
 Examples: Neo4J, FreeBase (Google)
So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL
عصر کلان داده، چرا و چگونه؟
Apache Hadoop Projects
Apache Hadoop
 A framework for storing & processing Petabyte of data using commodity hardware
and storage
 Apache project
 Implemented in Java
 Community of contributors is growing
 Yahoo: HDFS and MapReduce
 Powerset: HBase
 Facebook: Hive and FairShare scheduler
 IBM: Eclipse plugins
Briefing history of Hadoop
Organization used hadoop
Hadoop System Principles
 Scale-Out rather than Scale-Up
 Bring code to data rather than data to code
 Deal with failures – they are common
 Abstract complexity of distributed and concurrent applications
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software Layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
 Very easy to scale down as well
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by
high-capacity link
 Many data-intensive applications are not CPU demanding causing bottlenecks in
network
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage
Failures are Common
 Given a large number machines, failures are common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried
Abstract Complexity
 Hadoop abstracts many complexities in distributed and concurrent applications
 Defines small number of components
 Provides simple and well defined interfaces of interactions between these components
 Frees developer from worrying about system level challenges
 processing pipelines, data partitioning, code distribution
 Allows developers to focus on application development and business logic
Distribution Vendors
 Cloudera Distribution for Hadoop (CDH)
 MapR Distribution
 Hortonworks Data Platform (HDP)
 Apache BigTop Distribution
Components
 Distributed File System
 HDFS
 Distributed Processing Framework
 Map/Reduce
The Storage:
Hadoop Distributed File System
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware
HDFS Daemons
Files and Blocks
HDFS Component Communication
REPLICA MANGEMENT
 A common practice is to spread the nodes across multiple racks
 A good replica placement policy should improve data reliability, availability,
and network bandwidth utilization
 Namenode determines replica placement
NETWORK TOPOLOGY AND HADOOP
The Execution Engine:
Apache Yarn
Apache Yarn
Yarn Components
 RescourceManager:
 Arbitrates resources among all the applications in the
system
 NodeManager:
 the per-machine slave, which is responsible for launching
the applications’ containers, monitoring their resource
usage
 ApplicationMaster:
 Negotiate appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress
 Container:
 Unit of allocation incorporating resource elements such as
memory, cpu, disk, network etc., to execute a specific task of the
application (similar to map/reduce slots in MRv1)
YARN Architecture
The Processing Model:
MapReduce
Hadoop Mapreduce Framework
What is MapReduce?
 Parallel programming model for large clusters
 User implements Map() and Reduce()
 Parallel computing framework
 Libraries take care of EVERYTHING else
 Parallelization
 Fault Tolerance
 Data Distribution
 Load Balancing
 MapReduce library does most of the hard work for us!
 Takes care of distributed processing and coordination
 Scheduling
 Task Localization with Data
 Error Handling
 Data Synchronization
MapReduce: Data Flow
Map and Reduce
 Map()
 Map workers read in contents of corresponding input partition
 Process a key/value pair to generate intermediate key/value pairs
 Reduce()
 Merge all intermediate values associated with the same key
 eg. <key, [value1, value2,..., valueN]>
 Output of user's reduce function is written to output file on global file system
 When all tasks have completed, master wakes up user program
Distributed Processing
 Word count on a huge file
Mapreduce Model
Example: Counting Words
عصر کلان داده، چرا و چگونه؟

More Related Content

What's hot (20)

PPTX
Big Data Unit 4 - Hadoop
RojaT4
 
PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PPTX
Big data technology unit 3
RojaT4
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PPT
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
PDF
Apache Hadoop - Big Data Engineering
BADR
 
PPTX
Big Data Concepts
Ahmed Salman
 
PDF
Big Data Architecture Workshop - Vahid Amiri
datastack
 
PPT
Presentation on Hadoop Technology
OpenDev
 
PPTX
Apache Hadoop
Ajit Koti
 
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PPTX
Hadoop
Archana Gopinath
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PDF
What is hadoop
Asis Mohanty
 
PPTX
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Big Data Unit 4 - Hadoop
RojaT4
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Big data ppt
Thirunavukkarasu Ps
 
Big data technology unit 3
RojaT4
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
Apache Hadoop - Big Data Engineering
BADR
 
Big Data Concepts
Ahmed Salman
 
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Presentation on Hadoop Technology
OpenDev
 
Apache Hadoop
Ajit Koti
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Introduction To Hadoop Ecosystem
InSemble
 
What is hadoop
Asis Mohanty
 
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
Hadoop Technology
Atul Kushwaha
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 

Viewers also liked (20)

PDF
فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
Ehsan Asgarian
 
PDF
Internet of Things Security Challlenges
quickheal_co_ir
 
PDF
تشخیص انجمن در مقیاس کلان داده
Navid Sedighpour
 
PPTX
داده های جریانی streaming data
Hosseinieh Ershad Public Library
 
PDF
Big Data and select suitable tools
Meghdad Hatami
 
PDF
A Story of Big Data:Introduction
Mobin Ranjbar
 
PPT
Big data بزرگ داده ها
Omid Sohrabi
 
PDF
کلان داده کاربردها و چالش های آن
Hamed Azizi
 
PDF
داده های عظیم چگونه دنیا را تغییر خواهند داد
Farzad Khandan
 
PDF
بیگ دیتا
Hamed Azizi
 
PPTX
اینترنت اشیا در 10 دقیقه
Mahmood Neshati (PhD)
 
PPTX
Emilio aparicio
Claudia Lizardo
 
PDF
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Amir Sedighi
 
PDF
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Amir Sedighi
 
PPTX
Gifted education
Selime Akın
 
PPTX
Family Circle Presentation
aboveitalltreatmentcenter
 
PPTX
Hardware Provisioning
MongoDB
 
PPTX
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
PPTX
3Com 69-001212-00
savomir
 
DOCX
Tarea taller 4 mapa de conceptos isamalia muñiz
Isamalia Muniz
 
فناوری‌های حوزه‌ی کلان داده - Introduction to Big Data Technologies
Ehsan Asgarian
 
Internet of Things Security Challlenges
quickheal_co_ir
 
تشخیص انجمن در مقیاس کلان داده
Navid Sedighpour
 
داده های جریانی streaming data
Hosseinieh Ershad Public Library
 
Big Data and select suitable tools
Meghdad Hatami
 
A Story of Big Data:Introduction
Mobin Ranjbar
 
Big data بزرگ داده ها
Omid Sohrabi
 
کلان داده کاربردها و چالش های آن
Hamed Azizi
 
داده های عظیم چگونه دنیا را تغییر خواهند داد
Farzad Khandan
 
بیگ دیتا
Hamed Azizi
 
اینترنت اشیا در 10 دقیقه
Mahmood Neshati (PhD)
 
Emilio aparicio
Claudia Lizardo
 
Big Data and Machine Learning Workshop - Day 3 @ UTACM
Amir Sedighi
 
Two Case Studies Big-Data and Machine Learning at Scale Solutions in Iran
Amir Sedighi
 
Gifted education
Selime Akın
 
Family Circle Presentation
aboveitalltreatmentcenter
 
Hardware Provisioning
MongoDB
 
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
3Com 69-001212-00
savomir
 
Tarea taller 4 mapa de conceptos isamalia muñiz
Isamalia Muniz
 
Ad

Similar to عصر کلان داده، چرا و چگونه؟ (20)

PPTX
MOD-2 presentation on engineering students
rishavkumar1402
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPTX
Big Data and Hadoop
Mr. Ankit
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPT
Big Data & Hadoop
Krishna Sujeer
 
PPTX
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
DOCX
hadoop seminar training report
Sarvesh Meena
 
PPTX
Big data
Mina Soltani
 
PDF
Big data presentation
SreeSowmya7
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PDF
Making Big Data, small
MarcinJedyk
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
BIG DATA
Dr. Shashank Shetty
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PPTX
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
MOD-2 presentation on engineering students
rishavkumar1402
 
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Big data and hadoop overvew
Kunal Khanna
 
Understanding Hadoop
Ahmed Ossama
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Big Data and Hadoop
Mr. Ankit
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Big Data & Hadoop
Krishna Sujeer
 
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
hadoop seminar training report
Sarvesh Meena
 
Big data
Mina Soltani
 
Big data presentation
SreeSowmya7
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Making Big Data, small
MarcinJedyk
 
Introduction to Hadoop and Big Data
Joe Alex
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
Ad

Recently uploaded (20)

PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
short term internship project on Data visualization
JMJCollegeComputerde
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 

عصر کلان داده، چرا و چگونه؟

  • 1. ‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬ ‫چگونه؟‬ VAHID AMIRI VAHIDAMIRY.IR [email protected]
  • 2. Big DataData Data Processing Data Gathering Data Storing
  • 4. Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 6. 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeveryday 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 Volume
  • 7. Variety (Complexity)  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF), …  Streaming Data  You can only scan the data once  Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together
  • 8. A Single View to the Customer Customer Social Media Gaming Entertain Bankin g Financ e Our Known History Purchase
  • 9. Velocity (Speed)  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities Social media and networks (all of us are generating data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 10. Some Make it 4V’s
  • 11.  The Model of Generating/Consuming Data has Changed The Model Has Changed… Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 14.  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster Hadoop
  • 17.  More than just the Elephant in the room  Over 120+ types of NoSQL databases So many NoSQL options
  • 18.  Extend the Scope of RDBMS  Caching  Master/Slave  Table Partitioning  Federated Tables  Sharding NoSql  Relational database (RDBMS) technology  Has not fundamentally changed in over 40 years  Default choice for holding data behind many web apps  Handling more users means adding a bigger server
  • 19. RDBMS with Extended Functionality Vs. Systems Built from Scratch with Scalability in Mind NoSQL Movement
  • 20. CAP Theorem  “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”
  • 21. “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”  CA  Highly-available consistency  CP  Enforced consistency  AP  Eventual consistency CAP Theorem
  • 23.  Schema-less  State (Persistent or Volatile)  Example:  Redis  Amazon DynamoDB Key / Value Database
  • 24.  Wide, sparse column sets  Schema-light  Examples:  Cassandra  HBase  BigTable  GAE HR DS Column Database
  • 25.  Use for data that is  document-oriented (collection of JSON documents) w/semi structured data  Encodings include XML, YAML, JSON & BSON  binary forms  PDF, Microsoft Office documents -- Word, Excel…)  Examples: MongoDB, CouchDB Document Database
  • 26. Graph Database Use for data with  a lot of many-to-many relationships  when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data  Examples: Neo4J, FreeBase (Google)
  • 27. So which type of NoSQL? Back to CAP… CP = noSQL/column Hadoop Big Table HBase MemCacheDB AP = noSQL/document or key/value DynamoDB CouchDB Cassandra Voldemort CA = SQL/RDBMS SQL Sever / SQL Azure Oracle MySQL
  • 30. Apache Hadoop  A framework for storing & processing Petabyte of data using commodity hardware and storage  Apache project  Implemented in Java  Community of contributors is growing  Yahoo: HDFS and MapReduce  Powerset: HBase  Facebook: Hive and FairShare scheduler  IBM: Eclipse plugins
  • 33. Hadoop System Principles  Scale-Out rather than Scale-Up  Bring code to data rather than data to code  Deal with failures – they are common  Abstract complexity of distributed and concurrent applications
  • 34. Scale-Out Instead of Scale-Up  It is harder and more expensive to scale-up  Add additional resources to an existing node (CPU, RAM)  New units must be purchased if required resources can not be added  Also known as scale vertically  Scale-Out  Add more nodes/machines to an existing distributed application  Software Layer is designed for node additions or removal  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system  Very easy to scale down as well
  • 35. Code to Data  Traditional data processing architecture  Nodes are broken up into separate processing and storage nodes connected by high-capacity link  Many data-intensive applications are not CPU demanding causing bottlenecks in network
  • 36. Code to Data  Hadoop co-locates processors and storage  Code is moved to data (size is tiny, usually in KBs)  Processors execute code and access underlying local storage
  • 37. Failures are Common  Given a large number machines, failures are common  Large warehouses may see machine failures weekly or even daily  Hadoop is designed to cope with node failures  Data is replicated  Tasks are retried
  • 38. Abstract Complexity  Hadoop abstracts many complexities in distributed and concurrent applications  Defines small number of components  Provides simple and well defined interfaces of interactions between these components  Frees developer from worrying about system level challenges  processing pipelines, data partitioning, code distribution  Allows developers to focus on application development and business logic
  • 39. Distribution Vendors  Cloudera Distribution for Hadoop (CDH)  MapR Distribution  Hortonworks Data Platform (HDP)  Apache BigTop Distribution
  • 40. Components  Distributed File System  HDFS  Distributed Processing Framework  Map/Reduce
  • 42. HDFS is Good for...  Storing large files  Terabytes, Petabytes, etc...  Millions rather than billions of files  100MB or more per file  Streaming data  Write once and read-many times patterns  Optimized for streaming reads rather than random reads  “Cheap” Commodity Hardware  No need for super-computers, use less reliable commodity hardware
  • 46. REPLICA MANGEMENT  A common practice is to spread the nodes across multiple racks  A good replica placement policy should improve data reliability, availability, and network bandwidth utilization  Namenode determines replica placement
  • 50. Yarn Components  RescourceManager:  Arbitrates resources among all the applications in the system  NodeManager:  the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage  ApplicationMaster:  Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress  Container:  Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc., to execute a specific task of the application (similar to map/reduce slots in MRv1)
  • 54. What is MapReduce?  Parallel programming model for large clusters  User implements Map() and Reduce()  Parallel computing framework  Libraries take care of EVERYTHING else  Parallelization  Fault Tolerance  Data Distribution  Load Balancing  MapReduce library does most of the hard work for us!  Takes care of distributed processing and coordination  Scheduling  Task Localization with Data  Error Handling  Data Synchronization
  • 56. Map and Reduce  Map()  Map workers read in contents of corresponding input partition  Process a key/value pair to generate intermediate key/value pairs  Reduce()  Merge all intermediate values associated with the same key  eg. <key, [value1, value2,..., valueN]>  Output of user's reduce function is written to output file on global file system  When all tasks have completed, master wakes up user program
  • 57. Distributed Processing  Word count on a huge file