Hadoop Technical Presentation

OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Hadoop
Remember, you asked for it

//////
// // // //
01
Distributed systems concepts
02
Hadoop genesis
03
HDFS
04
MapReduce
05
YARN
06
Ecosystem
07
Architecture examples
2

T H E R E I S A B E T T E R
W A Y
DISTRIBUTED SYSTEMS CONCEPTS
01

OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4
DISTRIBUTED SYSTEMS
A distributed system is a system whose components are located on different networked
computers, which then communicate and coordinate their actions by passing messages to
each other.

 CAP Theorem?
 PACELC Theorem?
 Partitioning
< Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data
 Replication
< Copy of the data over different nodes
 Durability vs availability
< Durability is long term data protection, power goes out what happen?
< Availability is to be able to deliver the data, network outage, do you still deliver?
 Concurrency vs parallelism
< Concurrency is the composition of independently executing processes (Go)
< Parallelism is the simultaneous execution of (possibly related) computations (Spark)
 Yield and Harvest: UX metrics
5
CONCEPTS

W A Y
HADOOP GENESIS
02

What is Hadoop?
It’s a framework for distributed storage and processing of data, theoretically capable of
scaling to thousands of nodes

What is a data lake?
A data lake is a scalable and evolutive platform that stores multiple
kinds of data. The data therein is subject to added-value processing,
with the purpose of being exposed to all business lines of the
enterprise.

How was it created?

 Web giants company are accumulating Data
 Data = value
 We need to store it, there’s a large volume of it
 Database technologies are not a viable solution especially given the variety of the data
 We need to be able to process it at acceptable speed (velocity)
10
Why Hadoop?
Data
Time
Little
Lots
Hadoop
Everything on Hadoop is designed to be:
< Durable
< Fault tolerant
< Resilient
< Distributed
“Hardware eventually fails. Software eventually works.”
Michael Hartung

HDFS Characteristics
Characteristic Description
Hierarchical Directories containing files are arranged in a series of parent-child relationships.
Distributed File system storage spans multiple drives and hosts.
Replicated The file system automatically maintains multiple copies of data blocks.
Write-once, read-many optimized The file system is designed to write data once but read the data multiple times.
Sequential access The file system is designed for large sequential writes and reads.
Multiple readers Multiple HDFS clients may read data at the same time.
Single writer To protect file system integrity, only a single writer at a time is allowed.
Append-only Files may be appended, but existing data not updated.

W A Y
HDFS
03

HDFS Architecture

 Master/Slave architecture
 High availability
 Replication
 Quotas
 Heterogeneous storage (SSD, HDD, RAM disk)
 Snapshotting
 Rack awareness
 ACLs/Access masks
 Node Rebalancing
 WebHDFS
 Filesystem checks
 Centralised cache
 Erasure encoding
14
HDFS Features

 Pros
< HDFS and YARN are very well integrated
< If on premise is a requirement
< Highly customisable
< Faster writes
< Move operations are just renames
< Data locality (No Namenode on AWS S3, it does not point to a location but streams data)
< Data integrity (Eventual consistency of S3 and atomicity of operations)
 Cons
< Cloud storages are managed
< Cloud storages are elastic (pay as you go model)
< Container management platforms are popular
< Master/Slaves architecture
< Cost
< …
15
Hadoop pros and cons

W A Y
MapReduce
04

Make a sandwich in MapReduce

Hadoop MR vs Spark

W A Y
YARN
05

YARN ARCHITECTURE

CLUSTER BIG PICTURE
worker node 1
NodeManag
er
DataNode
master node
NameNode
Resource
Manager
ZooKeeper
History
…
utility node
Knox
Gateway
Ambari
…
worker node 2
NodeManag
er
DataNode
worker node 4
NodeManag
er
DataNode
worker node 3
NodeManag
er
DataNode
worker node 6
NodeManag
er
DataNode
admin backup
Additional
and backup
component
s for master
and utility
node…
worker node 5
NodeManag
er
DataNode
worker node
10NodeManag
er
DataNode
worker node 8
NodeManag
er
DataNode
worker node 9
NodeManag
er
DataNode
worker node 7
NodeManag
er
DataNode
Aggregate
pool of
resources
1,280 GB
RAM

YARN Component responsibilities
ResourceManager NodeManager Container ApplicationMaster
Schedule global resources
Manage local memory and CPU
allocation
Allocated RAM and CPU cores
by NodeManager
YARN application bootstrap
process
Enable multitenancy Negotiate resources
Enable SLA enforcement
Provide application fault
tolerance
Monitor and manage
NodeManagers
Track and report on node
health
Work with NodeManager for
container restart
Monitor and manage
ApplicationMasters
Manage file localization for
containers
Run ApplicationMasters and job
tasks
Monitor containers globally
Monitor and manage local
containers
Monitor job tasks and
containers across cluster
Manage ACLs
Manage Tokens

 Queues
 Priority
 Preemption of resources
 ACL
 User limits
 Log aggregation
 Container placement
 High availability
 Heterogeneous workloads
 Nodes labelling
 FairScheduler, Capacity Scheduler, custom
 Stateless and stateful
23
YARN Features

YARN vs the world
I got a container, place it on a node - I need this, much
- Okay, put it there
Cluster state stored at app level

W A Y
ECOSYSTEM
06

 The big three
< Hortonworks + IBM Big Insights (Gone)
< Cloudera
< MapR
 And the others (not exhaustive)
< Pivotal
< Microsoft
< Terradata HD (MPP)
< Datastax Enterprise analytics
< Dremio
 Cloud
< AWS EMR
< GCP Dataflow (imp. of Apache Beam)
< GCP
< Azure Insights
26
Platforms

Hortonworks

 Resource Management
< YARN
< Mesos
< OpenShift
< Kubernetes
< Nomad
< Titus
 NoSQL including TS Databases
< Druid
< Cassandra
< Hbase
 Graph databases
< JanusGraph
< Neo4J
 Document store
< AWS DynamoDB
< MongoDB
< CouchBase
 Distributed Storage
< HDFS
< AWS S3
< Azure Storage
< GCP Cloud Storage
< Ceph
 Monitoring
< Ganglia
< Nagios
< Prometheus
< Datadog
< Ambari
 Security
< Kerberos
 Access
< Ranger
< Sentry
 SQL
< Hive
< Impala
< Drill
< Google Big Query
< AWS Athena
 UI
< Hue
< Ambari
< Zeppelin
< Jupyter
 Search
< SolR
< ElasticSearch
< Algolia
 Log management
< Log Stash
< Flume
< FluentD
< AWS CloudWatch
 Machine (deep) learning
< Tensorflow
< Kaffe
< MXNet
< Spark ML
 Streaming/Batch processing
< Spark
< Flink
< Apex
< KStreams
 Messaging
< Kafka
< RabbitMQ
 Governance
< Atlas
< Spline
< Falcon
28
NEED. MORE. TOOLS.

W A Y
ARCHITECTURE EXAMPLES
07

 Cassandra
< Token ring, token (hash) is computed, data is sent to a node and
replicas to other nodes in the ring
< Coordinator keeps track of who get what range of keys
< Gossip protocol to know who has data
30
Other examples

"If computers get too powerful, we can organize them into committees. That'll do them in.”
Steve Wozniak
31
Consensus algorithm

 https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
 https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2
 http://firmament.io/blog/scheduler-architectures.html
 https://codahale.com/you-cant-sacrifice-partition-tolerance/
32
References

Hadoop Technical Presentation

More Related Content

What's hot (20)

Similar to Hadoop Technical Presentation (20)

Recently uploaded (20)

Hadoop Technical Presentation

Editor's Notes