SlideShare a Scribd company logo
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Hadoop
Remember, you asked for it
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
//////
// // // //
01
Distributed systems concepts
02
Hadoop genesis
03
HDFS
04
MapReduce
05
YARN
06
Ecosystem
07
Architecture examples
2
T H E R E I S A B E T T E R
W A Y
DISTRIBUTED SYSTEMS CONCEPTS
01
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4
DISTRIBUTED SYSTEMS
A distributed system is a system whose components are located on different networked
computers, which then communicate and coordinate their actions by passing messages to
each other.
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 CAP Theorem?
 PACELC Theorem?
 Partitioning
< Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data
 Replication
< Copy of the data over different nodes
 Durability vs availability
< Durability is long term data protection, power goes out what happen?
< Availability is to be able to deliver the data, network outage, do you still deliver?
 Concurrency vs parallelism
< Concurrency is the composition of independently executing processes (Go)
< Parallelism is the simultaneous execution of (possibly related) computations (Spark)
 Yield and Harvest: UX metrics
5
CONCEPTS
T H E R E I S A B E T T E R
W A Y
HADOOP GENESIS
02
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 7
What is Hadoop?
It’s a framework for distributed storage and processing of data, theoretically capable of
scaling to thousands of nodes
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 8
What is a data lake?
A data lake is a scalable and evolutive platform that stores multiple
kinds of data. The data therein is subject to added-value processing,
with the purpose of being exposed to all business lines of the
enterprise.
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 9
How was it created?
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Web giants company are accumulating Data
 Data = value
 We need to store it, there’s a large volume of it
 Database technologies are not a viable solution especially given the variety of the data
 We need to be able to process it at acceptable speed (velocity)
10
Why Hadoop?
Data
Time
Little
Lots
Hadoop
Everything on Hadoop is designed to be:
< Durable
< Fault tolerant
< Resilient
< Distributed
“Hardware eventually fails. Software eventually works.”
Michael Hartung
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 11
HDFS Characteristics
Characteristic Description
Hierarchical Directories containing files are arranged in a series of parent-child relationships.
Distributed File system storage spans multiple drives and hosts.
Replicated The file system automatically maintains multiple copies of data blocks.
Write-once, read-many optimized The file system is designed to write data once but read the data multiple times.
Sequential access The file system is designed for large sequential writes and reads.
Multiple readers Multiple HDFS clients may read data at the same time.
Single writer To protect file system integrity, only a single writer at a time is allowed.
Append-only Files may be appended, but existing data not updated.
T H E R E I S A B E T T E R
W A Y
HDFS
03
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 13
HDFS Architecture
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Master/Slave architecture
 High availability
 Replication
 Quotas
 Heterogeneous storage (SSD, HDD, RAM disk)
 Snapshotting
 Rack awareness
 ACLs/Access masks
 Node Rebalancing
 WebHDFS
 Filesystem checks
 Centralised cache
 Erasure encoding
14
HDFS Features
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Pros
< HDFS and YARN are very well integrated
< If on premise is a requirement
< Highly customisable
< Faster writes
< Move operations are just renames
< Data locality (No Namenode on AWS S3, it does not point to a location but streams data)
< Data integrity (Eventual consistency of S3 and atomicity of operations)
 Cons
< Cloud storages are managed
< Cloud storages are elastic (pay as you go model)
< Container management platforms are popular
< Master/Slaves architecture
< Cost
< …
15
Hadoop pros and cons
T H E R E I S A B E T T E R
W A Y
MapReduce
04
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 17
Make a sandwich in MapReduce
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 18
Hadoop MR vs Spark
T H E R E I S A B E T T E R
W A Y
YARN
05
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 20
YARN ARCHITECTURE
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 21
CLUSTER BIG PICTURE
worker node 1
NodeManag
er
DataNode
master node
NameNode
Resource
Manager
ZooKeeper
History
…
utility node
Knox
Gateway
Ambari
…
worker node 2
NodeManag
er
DataNode
worker node 4
NodeManag
er
DataNode
worker node 3
NodeManag
er
DataNode
worker node 6
NodeManag
er
DataNode
admin backup
Additional
and backup
component
s for master
and utility
node…
worker node 5
NodeManag
er
DataNode
worker node
10NodeManag
er
DataNode
worker node 8
NodeManag
er
DataNode
worker node 9
NodeManag
er
DataNode
worker node 7
NodeManag
er
DataNode
Aggregate
pool of
resources
1,280 GB
RAM
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 22
YARN Component responsibilities
ResourceManager NodeManager Container ApplicationMaster
Schedule global resources
Manage local memory and CPU
allocation
Allocated RAM and CPU cores
by NodeManager
YARN application bootstrap
process
Enable multitenancy Negotiate resources
Enable SLA enforcement
Provide application fault
tolerance
Monitor and manage
NodeManagers
Track and report on node
health
Work with NodeManager for
container restart
Monitor and manage
ApplicationMasters
Manage file localization for
containers
Run ApplicationMasters and job
tasks
Monitor containers globally
Monitor and manage local
containers
Monitor job tasks and
containers across cluster
Manage ACLs
Manage Tokens
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Queues
 Priority
 Preemption of resources
 ACL
 User limits
 Log aggregation
 Container placement
 High availability
 Heterogeneous workloads
 Nodes labelling
 FairScheduler, Capacity Scheduler, custom
 Stateless and stateful
23
YARN Features
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 24
YARN vs the world
I got a container, place it on a node - I need this, much
- Okay, put it there
Cluster state stored at app level
T H E R E I S A B E T T E R
W A Y
ECOSYSTEM
06
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 The big three
< Hortonworks + IBM Big Insights (Gone)
< Cloudera
< MapR
 And the others (not exhaustive)
< Pivotal
< Microsoft
< Terradata HD (MPP)
< Datastax Enterprise analytics
< Dremio
 Cloud
< AWS EMR
< GCP Dataflow (imp. of Apache Beam)
< GCP
< Azure Insights
26
Platforms
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 27
Hortonworks
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Resource Management
< YARN
< Mesos
< OpenShift
< Kubernetes
< Nomad
< Titus
 NoSQL including TS Databases
< Druid
< Cassandra
< Hbase
 Graph databases
< JanusGraph
< Neo4J
 Document store
< AWS DynamoDB
< MongoDB
< CouchBase
 Distributed Storage
< HDFS
< AWS S3
< Azure Storage
< GCP Cloud Storage
< Ceph
 Monitoring
< Ganglia
< Nagios
< Prometheus
< Datadog
< Ambari
 Security
< Kerberos
 Access
< Ranger
< Sentry
 SQL
< Hive
< Impala
< Drill
< Google Big Query
< AWS Athena
 UI
< Hue
< Ambari
< Zeppelin
< Jupyter
 Search
< SolR
< ElasticSearch
< Algolia
 Log management
< Log Stash
< Flume
< FluentD
< AWS CloudWatch
 Machine (deep) learning
< Tensorflow
< Kaffe
< MXNet
< Spark ML
 Streaming/Batch processing
< Spark
< Flink
< Apex
< KStreams
 Messaging
< Kafka
< RabbitMQ
 Governance
< Atlas
< Spline
< Falcon
28
NEED. MORE. TOOLS.
T H E R E I S A B E T T E R
W A Y
ARCHITECTURE EXAMPLES
07
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Cassandra
< Token ring, token (hash) is computed, data is sent to a node and
replicas to other nodes in the ring
< Coordinator keeps track of who get what range of keys
< Gossip protocol to know who has data
30
Other examples
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
"If computers get too powerful, we can organize them into committees. That'll do them in.”
Steve Wozniak
31
Consensus algorithm
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
 https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2
 http://firmament.io/blog/scheduler-architectures.html
 https://codahale.com/you-cant-sacrifice-partition-tolerance/
32
References
Hadoop Technical Presentation

More Related Content

What's hot (20)

PDF
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
PDF
Ceph as software define storage
Mahmoud Shiri Varamini
 
PDF
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red_Hat_Storage
 
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
PPTX
Apache Hadoop YARN 3.x in Alibaba
DataWorks Summit
 
ODP
Glusterfs and Hadoop
Shubhendu Tripathi
 
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon
 
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
PPTX
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Red_Hat_Storage
 
PPT
HDFS Issues
Steve Loughran
 
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
DOCX
Sanjay resume 2019_post
Sanjay Arya
 
PDF
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
PDF
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]
オラクルエンジニア通信
 
PDF
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Red_Hat_Storage
 
PDF
caching2012.pdf
KarthikS573262
 
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
PPTX
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red_Hat_Storage
 
PPTX
Big Data and its emergence
koolkalpz
 
PDF
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red_Hat_Storage
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
DataStax Academy
 
Ceph as software define storage
Mahmoud Shiri Varamini
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red_Hat_Storage
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Apache Hadoop YARN 3.x in Alibaba
DataWorks Summit
 
Glusterfs and Hadoop
Shubhendu Tripathi
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Red_Hat_Storage
 
HDFS Issues
Steve Loughran
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Sanjay resume 2019_post
Sanjay Arya
 
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]
オラクルエンジニア通信
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Red_Hat_Storage
 
caching2012.pdf
KarthikS573262
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red_Hat_Storage
 
Big Data and its emergence
koolkalpz
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red_Hat_Storage
 

Similar to Hadoop Technical Presentation (20)

PDF
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PDF
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Google take on heterogeneous data base replication
Svetlin Stanchev
 
PDF
Equinix Big Data Platform and Cassandra - A view into the journey
Praveen Kumar
 
PDF
Get to know the browser better and write faster web apps
Lior Bar-On
 
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
PDF
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
PPTX
Managing 100s of PetaBytes of data in Cloud
lohitvijayarenu
 
PPTX
Hadoop workshop
Fang Mac
 
PDF
Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...
DataStax Academy
 
PDF
Vue d'ensemble Dremio
Modern Data Stack France
 
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
PDF
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Google take on heterogeneous data base replication
Svetlin Stanchev
 
Equinix Big Data Platform and Cassandra - A view into the journey
Praveen Kumar
 
Get to know the browser better and write faster web apps
Lior Bar-On
 
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
Managing 100s of PetaBytes of data in Cloud
lohitvijayarenu
 
Hadoop workshop
Fang Mac
 
Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...
DataStax Academy
 
Vue d'ensemble Dremio
Modern Data Stack France
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
July Patch Tuesday
Ivanti
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
July Patch Tuesday
Ivanti
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Ad

Hadoop Technical Presentation

  • 1. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable Hadoop Remember, you asked for it
  • 2. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable ////// // // // // 01 Distributed systems concepts 02 Hadoop genesis 03 HDFS 04 MapReduce 05 YARN 06 Ecosystem 07 Architecture examples 2
  • 3. T H E R E I S A B E T T E R W A Y DISTRIBUTED SYSTEMS CONCEPTS 01
  • 4. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4 DISTRIBUTED SYSTEMS A distributed system is a system whose components are located on different networked computers, which then communicate and coordinate their actions by passing messages to each other.
  • 5. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  CAP Theorem?  PACELC Theorem?  Partitioning < Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data  Replication < Copy of the data over different nodes  Durability vs availability < Durability is long term data protection, power goes out what happen? < Availability is to be able to deliver the data, network outage, do you still deliver?  Concurrency vs parallelism < Concurrency is the composition of independently executing processes (Go) < Parallelism is the simultaneous execution of (possibly related) computations (Spark)  Yield and Harvest: UX metrics 5 CONCEPTS
  • 6. T H E R E I S A B E T T E R W A Y HADOOP GENESIS 02
  • 7. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 7 What is Hadoop? It’s a framework for distributed storage and processing of data, theoretically capable of scaling to thousands of nodes
  • 8. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 8 What is a data lake? A data lake is a scalable and evolutive platform that stores multiple kinds of data. The data therein is subject to added-value processing, with the purpose of being exposed to all business lines of the enterprise.
  • 9. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 9 How was it created?
  • 10. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Web giants company are accumulating Data  Data = value  We need to store it, there’s a large volume of it  Database technologies are not a viable solution especially given the variety of the data  We need to be able to process it at acceptable speed (velocity) 10 Why Hadoop? Data Time Little Lots Hadoop Everything on Hadoop is designed to be: < Durable < Fault tolerant < Resilient < Distributed “Hardware eventually fails. Software eventually works.” Michael Hartung
  • 11. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 11 HDFS Characteristics Characteristic Description Hierarchical Directories containing files are arranged in a series of parent-child relationships. Distributed File system storage spans multiple drives and hosts. Replicated The file system automatically maintains multiple copies of data blocks. Write-once, read-many optimized The file system is designed to write data once but read the data multiple times. Sequential access The file system is designed for large sequential writes and reads. Multiple readers Multiple HDFS clients may read data at the same time. Single writer To protect file system integrity, only a single writer at a time is allowed. Append-only Files may be appended, but existing data not updated.
  • 12. T H E R E I S A B E T T E R W A Y HDFS 03
  • 13. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 13 HDFS Architecture
  • 14. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Master/Slave architecture  High availability  Replication  Quotas  Heterogeneous storage (SSD, HDD, RAM disk)  Snapshotting  Rack awareness  ACLs/Access masks  Node Rebalancing  WebHDFS  Filesystem checks  Centralised cache  Erasure encoding 14 HDFS Features
  • 15. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Pros < HDFS and YARN are very well integrated < If on premise is a requirement < Highly customisable < Faster writes < Move operations are just renames < Data locality (No Namenode on AWS S3, it does not point to a location but streams data) < Data integrity (Eventual consistency of S3 and atomicity of operations)  Cons < Cloud storages are managed < Cloud storages are elastic (pay as you go model) < Container management platforms are popular < Master/Slaves architecture < Cost < … 15 Hadoop pros and cons
  • 16. T H E R E I S A B E T T E R W A Y MapReduce 04
  • 17. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 17 Make a sandwich in MapReduce
  • 18. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 18 Hadoop MR vs Spark
  • 19. T H E R E I S A B E T T E R W A Y YARN 05
  • 20. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 20 YARN ARCHITECTURE
  • 21. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 21 CLUSTER BIG PICTURE worker node 1 NodeManag er DataNode master node NameNode Resource Manager ZooKeeper History … utility node Knox Gateway Ambari … worker node 2 NodeManag er DataNode worker node 4 NodeManag er DataNode worker node 3 NodeManag er DataNode worker node 6 NodeManag er DataNode admin backup Additional and backup component s for master and utility node… worker node 5 NodeManag er DataNode worker node 10NodeManag er DataNode worker node 8 NodeManag er DataNode worker node 9 NodeManag er DataNode worker node 7 NodeManag er DataNode Aggregate pool of resources 1,280 GB RAM
  • 22. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 22 YARN Component responsibilities ResourceManager NodeManager Container ApplicationMaster Schedule global resources Manage local memory and CPU allocation Allocated RAM and CPU cores by NodeManager YARN application bootstrap process Enable multitenancy Negotiate resources Enable SLA enforcement Provide application fault tolerance Monitor and manage NodeManagers Track and report on node health Work with NodeManager for container restart Monitor and manage ApplicationMasters Manage file localization for containers Run ApplicationMasters and job tasks Monitor containers globally Monitor and manage local containers Monitor job tasks and containers across cluster Manage ACLs Manage Tokens
  • 23. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Queues  Priority  Preemption of resources  ACL  User limits  Log aggregation  Container placement  High availability  Heterogeneous workloads  Nodes labelling  FairScheduler, Capacity Scheduler, custom  Stateless and stateful 23 YARN Features
  • 24. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 24 YARN vs the world I got a container, place it on a node - I need this, much - Okay, put it there Cluster state stored at app level
  • 25. T H E R E I S A B E T T E R W A Y ECOSYSTEM 06
  • 26. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  The big three < Hortonworks + IBM Big Insights (Gone) < Cloudera < MapR  And the others (not exhaustive) < Pivotal < Microsoft < Terradata HD (MPP) < Datastax Enterprise analytics < Dremio  Cloud < AWS EMR < GCP Dataflow (imp. of Apache Beam) < GCP < Azure Insights 26 Platforms
  • 27. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 27 Hortonworks
  • 28. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Resource Management < YARN < Mesos < OpenShift < Kubernetes < Nomad < Titus  NoSQL including TS Databases < Druid < Cassandra < Hbase  Graph databases < JanusGraph < Neo4J  Document store < AWS DynamoDB < MongoDB < CouchBase  Distributed Storage < HDFS < AWS S3 < Azure Storage < GCP Cloud Storage < Ceph  Monitoring < Ganglia < Nagios < Prometheus < Datadog < Ambari  Security < Kerberos  Access < Ranger < Sentry  SQL < Hive < Impala < Drill < Google Big Query < AWS Athena  UI < Hue < Ambari < Zeppelin < Jupyter  Search < SolR < ElasticSearch < Algolia  Log management < Log Stash < Flume < FluentD < AWS CloudWatch  Machine (deep) learning < Tensorflow < Kaffe < MXNet < Spark ML  Streaming/Batch processing < Spark < Flink < Apex < KStreams  Messaging < Kafka < RabbitMQ  Governance < Atlas < Spline < Falcon 28 NEED. MORE. TOOLS.
  • 29. T H E R E I S A B E T T E R W A Y ARCHITECTURE EXAMPLES 07
  • 30. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Cassandra < Token ring, token (hash) is computed, data is sent to a node and replicas to other nodes in the ring < Coordinator keeps track of who get what range of keys < Gossip protocol to know who has data 30 Other examples
  • 31. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable "If computers get too powerful, we can organize them into committees. That'll do them in.” Steve Wozniak 31 Consensus algorithm
  • 32. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704  https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2  http://firmament.io/blog/scheduler-architectures.html  https://codahale.com/you-cant-sacrifice-partition-tolerance/ 32 References

Editor's Notes

  • #21: Container allocation of CPU, RAM and disk Spark driver inside the YARN application master, executor in containers
  • #25: Stateful app supported Different level of scheduling