SlideShare a Scribd company logo
PILOT HADOOP TOWARDS 2500 NODES ANDPILOT HADOOP TOWARDS 2500 NODES AND
CLUSTER REDUNDANCYCLUSTER REDUNDANCY
Stuart Pook (s.pook@criteo.com @StuartPook)
BROUGHT TO YOU BY LAKEBROUGHT TO YOU BY LAKE
Anna Savarin, Anthony Rabier, Meriam Lachkar, Nicolas Fraison,
Rémy Saissy, Stuart Pook, Thierry Lefort & Yohan Bismuth
CRITEOCRITEO
Online advertising
Target the right user
At the right time
With the right message
SIMPLIFED BUSINESS MODELSIMPLIFED BUSINESS MODEL
We buy
Ad spaces
We sell
Clicks — that convert — a lot
We take the risk
HADOOP AT CRITEO BACK IN 2014 (AM5)HADOOP AT CRITEO BACK IN 2014 (AM5)
1200 nodes
39 PB raw capacity
> 100 000 jobs/day
10 000 CPUs
HPE ProLiant Gen8
105 TB RAM
40 TB imported/day
Cloudera CDH4
TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT
Kafka
500 billion events per day
up to 4.3 million events/second
JSON → protobuf
72 hour buffers
PAID
COMPUTE AND DATA ARE ESSENTIALCOMPUTE AND DATA ARE ESSENTIAL
Extract, Transform & Load logs
Bidding models
Billing
Business analysis
HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY
Failing datanodes (1 or 2)
Failing racks (1)
Failing namenodes (1)
Failing resourcemanager (1)
NO PROTECTION AGAINSTNO PROTECTION AGAINST
Data centre disaster
Multiple datanode failures in a short time
Multiple rack failures in a day
Operator error
DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD
Backup long
Import 100 TB/day
Create 80 TB/day
Backup at 50 Gb/s?
Restore 2 PB too long at 50 Gb/s
What about compute?
COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD
20 000 CPUs requires reservation
Reservation expensive
No need for elasticity (batch processing)
Criteo has data centres
Criteo likes bare metal
Cloud > 8 times more expensive
In-house get us exactly the network & hardware we need
BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9
MONTHSMONTHS
Space for 5000 machines
Non-blocking network
10 Gb/s endpoints
Level 3 routing
Clos topology
Power 1 megawatt + option for 1 MW
It's impossible
NEW HARDWARENEW HARDWARE
Had one supplier
Need competition to keep prices down
3 replies to our call for tenders
3 similar 2U machines
16 (or 12) 6 TB SATA disks
2 Xeon E5-2650L v3, 24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
2 different RAID cards
TEST THE HARDWARETEST THE HARDWARE
Three 10 node clusters
Disk bandwidth
Disk errors?
Network bandwidth
Teragen
Zero replication (disks)
High replication (network)
Terasort
HARDWARE IS SIMILARHARDWARE IS SIMILAR
Eliminate the constructor with
4 DOA disks
Other failed disks
20% higher power consumption
Choose the cheapest and most dense
Huawei
LSI-3008 RAID card
MIX THE HARDWAREMIX THE HARDWARE
Operations are more difficult
Multiple configurations needed
Some clusters have both hardware types
We have more choice at each order
Avoid vendor lock-in
HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP
Configure using Chef
Infrastructure is code in git
Automate to scale (because we don't)
Test Hadoop with 10 hour petasorts
Tune Hadoop for this scale
Namenode machine crashes
Upgrade kernel
Rolling restart on all machines
Rack by rack, not node by node
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
A MISTAKEA MISTAKE
Made master and datanodes the same
Just one type of machine
Many available nodes if failure
Moving services with kerberos hard
Move master nodes to bigger machines
Offload DNS & KDC
FASTER MASTER NODESFASTER MASTER NODES
Namenode does many sequential operations
Long locks
Failovers too slow
Heartbeats lost
→ fast CPU
Two big namenodes
512 GB RAM
2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz
3 × RAID 1 of 2 SSD
PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE
More capacity as have 2 clusters
Users love it
Users find new uses
Soon using more capacity than the new cluster provides
Impossible to stop the old cluster
SITUATION NOW WORSESITUATION NOW WORSE
Two critical clusters
Two Hadoop versions to support CDH4 & CDH5
Not one but two SPOFs
GROW THE NEW DATACENTREGROW THE NEW DATACENTRE
Add hundreds of new nodes
Soon the new cluster will be big enough
1 370 datanodes
+ 650 in Q3 2017
+ 900 in Q4 2017
~ 3 000 datanodes end 2017
Too many blocks for the namenode?
ONE SPOF IS ENOUGHONE SPOF IS ENOUGH
Move all jobs and data to the new cluster (PA4)
Stop the old cluster (AM5)
Only one SPOF but still no redundancy
Have file backups on different technology
2018 BUILD ANOTHER CLUSTER2018 BUILD ANOTHER CLUSTER
Human users (development, machine-learning, BI)
QA & non-regression for service jobs
All data for service jobs
➡ PA4 backup for service jobs
But RAM is expensive
ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS
Need more CPU
Denser machines
4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U)
2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U)
Infrastructure validation
Hadoop tests and benchmarks
THE RAID CONTROLLER STORYTHE RAID CONTROLLER STORY
FIRST RAID CONTROLLERFIRST RAID CONTROLLER
Historically HPE Smart Array P420 (PMC-Sierra)
Only RAID 0 users in Criteo
OS status = RAID card status
Skip volumes flagged as bad by the RAID card
Very rare cases of fsck failures
Filesystems mounted with errors=remount-ro
Very rare cases of unflagged read-only filesystems
Access to the volumes blocked
Assumed to be standard behaviour
Operations need RAID card error flag
Out-of-band status → Jira ticket
Identification LED → disk swap
SECOND RAID CONTROLLER LSI-3008SECOND RAID CONTROLLER LSI-3008
Disks vanished
No diagnostic LED
Used ID LED on other disks
Later tested OK
Worked a er power cycle
Change 700 cards on a running cluster
All blocks lost on each card change
No downtime allowed
Rack by rack
Many HDFS stability problems
THIRD RAID CONTROLLER LSI-3108THIRD RAID CONTROLLER LSI-3108
LSI RAID card
Now OS flags bad disks before card
Failing fsck
Read-only filesystems
Volume seen as OK by the RAID card
O/S can access the volume and get timeouts
No error for out of band monitoring
No error for in-band monitoring
We can only handle OK or Failed volumes
2 SOLUTIONS WITH VENDOR LOCK-IN2 SOLUTIONS WITH VENDOR LOCK-IN
Buy all machines from HPE
Get the supplier to “solve” the problem for us
Agent running in the controller
They develop in China
We debug in France (in prod)
2 COMPLICATED SOLUTIONS2 COMPLICATED SOLUTIONS
Create RAID0 team to
Handle all error conditions
Stop access to the volume
Reformat (once) volume with read errors
Open tickets
Set identification LED
Work with LSI to tweak their controller
STAYING ALIVESTAYING ALIVE
Automate operations
Disk changes
Ramp ups
Infrastructure as code
Tests: kitchen, ChefSpec & preprod
Merge requests with reviews
Choreograph operations
Restarts (machines or services)
NEED LAKE CLUSTERNEED LAKE CLUSTER
Infrastructure validation
Test configuration recipes
Test Hadoop patches
And lab for hardware tests
STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP
Increase bandwidth and time limit for checkpoint
332 GB heap for namenode
180 GB reserved for native code & OS
Tune GC for namenode
Serial → not efficient on multi-thread
Parallel → long pauses + high throughput
Concurrent Mark Sweep → short pauses + lower throughput
G1 → in prod
G1NewSizePercent=0
G1RSetRegionEntries=4096
+ParallelRefProcEnabled & +PerfDisableSharedMem
Azul not required for 1300 datanodes but 3000?
STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS
The cluster crashes, find the bug, if fixed, backport it, else fix
Fix
HDFS-10220 expired leases make namenode unresponsive and
failover
Backport
YARN-4041 Slow delegation token renewal prolongs RM
recovery
HDFS-9305 Delayed heartbeat processing causes storm of
heartbeats
YARN-4546 ResourceManager crash due to scheduling
opportunity overflow
HDFS-9906 Remove spammy log spew when a datanode is
restarted
STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING
HDFS
Namenode: missing blocks, GC, checkpoints, safemode, QPS,
live datanode
Datanodes: disks, read/write throughput, space
YARN
Queue length, memory & CPU usage, job duration (scheduling
+ run time)
ResourceManager: QPS
Bad nodes
Probes to emulate client behavior with witness jobs
Zookeeper: availability, probes
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
CLUSTER COLLAPSECLUSTER COLLAPSE
Lots of blocks → 132 GB namenode (NN) heap full
User creates 20 million files & 20 PB data on a Friday a ernoon
NN gets stuck doing GC → no throughput
Increase standby heap size to 85% RAM via restart
Too many requests during restart (iptables)
Failover crashed
Fsimage on active corrupt as too big for transfer
Copy missing NN edits from journal node
Restart 1200 datanodes in batches
36 hours to recover the cluster
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
RESOURCES MANAGER SLOWSRESOURCES MANAGER SLOWS
Event EventType: KILL_CONTAINER sent to absent container
These messages happen occasionally
Almost no jobs running (8% capacity used)
Need to kill the applications
During NodeManager’s resync with the ResourceManager?
NEED SERVICE-LEVEL AGREEMENT (SLA)NEED SERVICE-LEVEL AGREEMENT (SLA)
Define
Time for operations
Job duration
Request handling
Measure
Monitoring
Respect
Some services are “best effort”
OPERATOR ERROROPERATOR ERROR
Same operators on both clusters
One chef server for both clusters
Single mistake → both clusters
WE HAVEWE HAVE
2 prod clusters
2 pre-prod clusters
1 infrastructure cluster
2 running CDH4
3 running CDH5
2682 datanodes
49 248 cores
135 PB disk space
842 TB RAM
> 300 000 jobs/day
100 TB imported daily
6 PB created or read per day
UPCOMING CHALLENGESUPCOMING CHALLENGES
Optimize and fix Hadoop
Add hundreds more datanodes
Create a new bare-metal data-centre
Make 2 big clusters work together
Improve scheduling
We are hiring
Come and join us in Paris, Palo Alto or Ann Arbor
s.pook@criteo.com @StuartPook
Questions?

More Related Content

What's hot (20)

PDF
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
PDF
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
DataStax Academy
 
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
PPTX
Cassandra Troubleshooting (for 2.0 and earlier)
J.B. Langston
 
PPTX
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
 
PPTX
Hadoop Query Performance Smackdown
DataWorks Summit
 
PDF
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
PDF
RedisConf17 - Searching Billions of Documents with Redis
Redis Labs
 
PPTX
Ceph - High Performance Without High Costs
Jonathan Long
 
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
PDF
AddThis: Scaling Cassandra up and down into containers with ZFS
DataStax Academy
 
PDF
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Christopher Bradford
 
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
PDF
Ceph Day San Jose - Object Storage for Big Data
Ceph Community
 
PPTX
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
PPTX
Cassandra Troubleshooting for 2.1 and later
J.B. Langston
 
PDF
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Colleen Corrice
 
PDF
Postgres in Amazon RDS
Denish Patel
 
Managing Cassandra at Scale by Al Tobey
DataStax Academy
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
DataStax Academy
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
Cassandra Troubleshooting (for 2.0 and earlier)
J.B. Langston
 
Ceph Deployment at Target: Customer Spotlight
Colleen Corrice
 
Hadoop Query Performance Smackdown
DataWorks Summit
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
DataStax Academy
 
RedisConf17 - Searching Billions of Documents with Redis
Redis Labs
 
Ceph - High Performance Without High Costs
Jonathan Long
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
AddThis: Scaling Cassandra up and down into containers with ZFS
DataStax Academy
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Christopher Bradford
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Community
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Cassandra Troubleshooting for 2.1 and later
J.B. Langston
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Colleen Corrice
 
Postgres in Amazon RDS
Denish Patel
 

Similar to Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy (20)

PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PDF
Scaling Hadoop at LinkedIn
DataWorks Summit
 
PPTX
Managing growth in Production Hadoop Deployments
DataWorks Summit
 
PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
PDF
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
PDF
Keep your Hadoop cluster at its best!
Sheetal Dolas
 
PPTX
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PPT
Apache hadoop and hive
srikanthhadoop
 
PPTX
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 
PPTX
HBase Operations and Best Practices
Venu Anuganti
 
PPTX
Keep your hadoop cluster at its best! v4
Chris Nauroth
 
PPTX
Hadoop configuration & performance tuning
Vitthal Gogate
 
PDF
Infrastructure Around Hadoop
DataWorks Summit
 
PPT
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Next Generation Hadoop Operations
Owen O'Malley
 
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Managing growth in Production Hadoop Deployments
DataWorks Summit
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
Keep your Hadoop cluster at its best!
Sheetal Dolas
 
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
Apache hadoop and hive
srikanthhadoop
 
Keep your Hadoop Cluster at its Best
DataWorks Summit/Hadoop Summit
 
HBase Operations and Best Practices
Venu Anuganti
 
Keep your hadoop cluster at its best! v4
Chris Nauroth
 
Hadoop configuration & performance tuning
Vitthal Gogate
 
Infrastructure Around Hadoop
DataWorks Summit
 
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PPTX
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PDF
The Internet - By the numbers, presented at npNOG 11
APNIC
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PPTX
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
The Internet - By the numbers, presented at npNOG 11
APNIC
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
internet básico presentacion es una red global
70965857
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
Ad

Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy

  • 1. PILOT HADOOP TOWARDS 2500 NODES ANDPILOT HADOOP TOWARDS 2500 NODES AND CLUSTER REDUNDANCYCLUSTER REDUNDANCY Stuart Pook ([email protected] @StuartPook)
  • 2. BROUGHT TO YOU BY LAKEBROUGHT TO YOU BY LAKE Anna Savarin, Anthony Rabier, Meriam Lachkar, Nicolas Fraison, Rémy Saissy, Stuart Pook, Thierry Lefort & Yohan Bismuth
  • 3. CRITEOCRITEO Online advertising Target the right user At the right time With the right message
  • 4. SIMPLIFED BUSINESS MODELSIMPLIFED BUSINESS MODEL We buy Ad spaces We sell Clicks — that convert — a lot We take the risk
  • 5. HADOOP AT CRITEO BACK IN 2014 (AM5)HADOOP AT CRITEO BACK IN 2014 (AM5) 1200 nodes 39 PB raw capacity > 100 000 jobs/day 10 000 CPUs HPE ProLiant Gen8 105 TB RAM 40 TB imported/day Cloudera CDH4
  • 6. TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT Kafka 500 billion events per day up to 4.3 million events/second JSON → protobuf 72 hour buffers
  • 7. PAID COMPUTE AND DATA ARE ESSENTIALCOMPUTE AND DATA ARE ESSENTIAL Extract, Transform & Load logs Bidding models Billing Business analysis
  • 8. HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY Failing datanodes (1 or 2) Failing racks (1) Failing namenodes (1) Failing resourcemanager (1)
  • 9. NO PROTECTION AGAINSTNO PROTECTION AGAINST Data centre disaster Multiple datanode failures in a short time Multiple rack failures in a day Operator error
  • 10. DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD Backup long Import 100 TB/day Create 80 TB/day Backup at 50 Gb/s? Restore 2 PB too long at 50 Gb/s What about compute?
  • 11. COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD 20 000 CPUs requires reservation Reservation expensive No need for elasticity (batch processing) Criteo has data centres Criteo likes bare metal Cloud > 8 times more expensive In-house get us exactly the network & hardware we need
  • 12. BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9 MONTHSMONTHS Space for 5000 machines Non-blocking network 10 Gb/s endpoints Level 3 routing Clos topology Power 1 megawatt + option for 1 MW It's impossible
  • 13. NEW HARDWARENEW HARDWARE Had one supplier Need competition to keep prices down 3 replies to our call for tenders 3 similar 2U machines 16 (or 12) 6 TB SATA disks 2 Xeon E5-2650L v3, 24 cores, 48 threads 256 GB RAM Mellanox 10 Gb/s network card 2 different RAID cards
  • 14. TEST THE HARDWARETEST THE HARDWARE Three 10 node clusters Disk bandwidth Disk errors? Network bandwidth Teragen Zero replication (disks) High replication (network) Terasort
  • 15. HARDWARE IS SIMILARHARDWARE IS SIMILAR Eliminate the constructor with 4 DOA disks Other failed disks 20% higher power consumption Choose the cheapest and most dense Huawei LSI-3008 RAID card
  • 16. MIX THE HARDWAREMIX THE HARDWARE Operations are more difficult Multiple configurations needed Some clusters have both hardware types We have more choice at each order Avoid vendor lock-in
  • 17. HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP Configure using Chef Infrastructure is code in git Automate to scale (because we don't) Test Hadoop with 10 hour petasorts Tune Hadoop for this scale Namenode machine crashes Upgrade kernel Rolling restart on all machines Rack by rack, not node by node
  • 19. A MISTAKEA MISTAKE Made master and datanodes the same Just one type of machine Many available nodes if failure Moving services with kerberos hard Move master nodes to bigger machines Offload DNS & KDC
  • 20. FASTER MASTER NODESFASTER MASTER NODES Namenode does many sequential operations Long locks Failovers too slow Heartbeats lost → fast CPU Two big namenodes 512 GB RAM 2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz 3 × RAID 1 of 2 SSD
  • 21. PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE More capacity as have 2 clusters Users love it Users find new uses Soon using more capacity than the new cluster provides Impossible to stop the old cluster
  • 22. SITUATION NOW WORSESITUATION NOW WORSE Two critical clusters Two Hadoop versions to support CDH4 & CDH5 Not one but two SPOFs
  • 23. GROW THE NEW DATACENTREGROW THE NEW DATACENTRE Add hundreds of new nodes Soon the new cluster will be big enough 1 370 datanodes + 650 in Q3 2017 + 900 in Q4 2017 ~ 3 000 datanodes end 2017 Too many blocks for the namenode?
  • 24. ONE SPOF IS ENOUGHONE SPOF IS ENOUGH Move all jobs and data to the new cluster (PA4) Stop the old cluster (AM5) Only one SPOF but still no redundancy Have file backups on different technology
  • 25. 2018 BUILD ANOTHER CLUSTER2018 BUILD ANOTHER CLUSTER Human users (development, machine-learning, BI) QA & non-regression for service jobs All data for service jobs ➡ PA4 backup for service jobs But RAM is expensive
  • 26. ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS Need more CPU Denser machines 4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U) 2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U) Infrastructure validation Hadoop tests and benchmarks
  • 27. THE RAID CONTROLLER STORYTHE RAID CONTROLLER STORY
  • 28. FIRST RAID CONTROLLERFIRST RAID CONTROLLER Historically HPE Smart Array P420 (PMC-Sierra) Only RAID 0 users in Criteo OS status = RAID card status Skip volumes flagged as bad by the RAID card Very rare cases of fsck failures Filesystems mounted with errors=remount-ro Very rare cases of unflagged read-only filesystems Access to the volumes blocked Assumed to be standard behaviour Operations need RAID card error flag Out-of-band status → Jira ticket Identification LED → disk swap
  • 29. SECOND RAID CONTROLLER LSI-3008SECOND RAID CONTROLLER LSI-3008 Disks vanished No diagnostic LED Used ID LED on other disks Later tested OK Worked a er power cycle Change 700 cards on a running cluster All blocks lost on each card change No downtime allowed Rack by rack Many HDFS stability problems
  • 30. THIRD RAID CONTROLLER LSI-3108THIRD RAID CONTROLLER LSI-3108 LSI RAID card Now OS flags bad disks before card Failing fsck Read-only filesystems Volume seen as OK by the RAID card O/S can access the volume and get timeouts No error for out of band monitoring No error for in-band monitoring We can only handle OK or Failed volumes
  • 31. 2 SOLUTIONS WITH VENDOR LOCK-IN2 SOLUTIONS WITH VENDOR LOCK-IN Buy all machines from HPE Get the supplier to “solve” the problem for us Agent running in the controller They develop in China We debug in France (in prod)
  • 32. 2 COMPLICATED SOLUTIONS2 COMPLICATED SOLUTIONS Create RAID0 team to Handle all error conditions Stop access to the volume Reformat (once) volume with read errors Open tickets Set identification LED Work with LSI to tweak their controller
  • 33. STAYING ALIVESTAYING ALIVE Automate operations Disk changes Ramp ups Infrastructure as code Tests: kitchen, ChefSpec & preprod Merge requests with reviews Choreograph operations Restarts (machines or services)
  • 34. NEED LAKE CLUSTERNEED LAKE CLUSTER Infrastructure validation Test configuration recipes Test Hadoop patches And lab for hardware tests
  • 35. STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP Increase bandwidth and time limit for checkpoint 332 GB heap for namenode 180 GB reserved for native code & OS Tune GC for namenode Serial → not efficient on multi-thread Parallel → long pauses + high throughput Concurrent Mark Sweep → short pauses + lower throughput G1 → in prod G1NewSizePercent=0 G1RSetRegionEntries=4096 +ParallelRefProcEnabled & +PerfDisableSharedMem Azul not required for 1300 datanodes but 3000?
  • 36. STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS The cluster crashes, find the bug, if fixed, backport it, else fix Fix HDFS-10220 expired leases make namenode unresponsive and failover Backport YARN-4041 Slow delegation token renewal prolongs RM recovery HDFS-9305 Delayed heartbeat processing causes storm of heartbeats YARN-4546 ResourceManager crash due to scheduling opportunity overflow HDFS-9906 Remove spammy log spew when a datanode is restarted
  • 37. STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING HDFS Namenode: missing blocks, GC, checkpoints, safemode, QPS, live datanode Datanodes: disks, read/write throughput, space YARN Queue length, memory & CPU usage, job duration (scheduling + run time) ResourceManager: QPS Bad nodes Probes to emulate client behavior with witness jobs Zookeeper: availability, probes
  • 39. CLUSTER COLLAPSECLUSTER COLLAPSE Lots of blocks → 132 GB namenode (NN) heap full User creates 20 million files & 20 PB data on a Friday a ernoon NN gets stuck doing GC → no throughput Increase standby heap size to 85% RAM via restart Too many requests during restart (iptables) Failover crashed Fsimage on active corrupt as too big for transfer Copy missing NN edits from journal node Restart 1200 datanodes in batches 36 hours to recover the cluster
  • 41. RESOURCES MANAGER SLOWSRESOURCES MANAGER SLOWS Event EventType: KILL_CONTAINER sent to absent container These messages happen occasionally Almost no jobs running (8% capacity used) Need to kill the applications During NodeManager’s resync with the ResourceManager?
  • 42. NEED SERVICE-LEVEL AGREEMENT (SLA)NEED SERVICE-LEVEL AGREEMENT (SLA) Define Time for operations Job duration Request handling Measure Monitoring Respect Some services are “best effort”
  • 43. OPERATOR ERROROPERATOR ERROR Same operators on both clusters One chef server for both clusters Single mistake → both clusters
  • 44. WE HAVEWE HAVE 2 prod clusters 2 pre-prod clusters 1 infrastructure cluster 2 running CDH4 3 running CDH5 2682 datanodes 49 248 cores 135 PB disk space 842 TB RAM > 300 000 jobs/day 100 TB imported daily 6 PB created or read per day
  • 45. UPCOMING CHALLENGESUPCOMING CHALLENGES Optimize and fix Hadoop Add hundreds more datanodes Create a new bare-metal data-centre Make 2 big clusters work together Improve scheduling We are hiring Come and join us in Paris, Palo Alto or Ann Arbor [email protected] @StuartPook Questions?