SlideShare a Scribd company logo
Kimoon Kim (kimoon@pepperdata.com)
HDFS on Kubernetes --
Lessons Learned
Outline
1. Kubernetes intro
2. Big Data on Kubernetes
3. Demo
4. Problems we fixed -- HDFS data locality
Kubernetes
New open-source cluster manager.
Runs programs in Linux containers.
1000+ contributors and 40,000+ commits.
“My app was running fine
until someone installed
their software”
More isolation is good
Kubernetes provides each program with:
• a lightweight virtual file system
– an independent set of S/W packages
• a virtual network interface
– a unique virtual IP address
– an entire range of ports
• etc
Kubernetes architecture
node A node B
Pod 1 Pod 2 Pod 3
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Pod, a unit of scheduling and isolation.
• runs a user program in a primary container
• holds isolation layers like an virtual IP in an infra container
Big Data on Kubernetes
github.com/apache-spark-on-k8s
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat,
and growing.
• patching up Spark Driver and Executor code to work on
Kubernetes.
Yesterday’s talk:
spark-summit.org/2017/events/apache-spark-on-kuberne
tes/
Spark on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
What about storage?
Spark often stores data on HDFS.
How can Spark on Kubernetes access HDFS data?
Hadoop Distributed File System
Namenode
• runs on a central cluster node.
• maintains file system metadata.
Datanodes
• on every cluster node.
• read and write file data on local disks.
HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS
hadoop.fs.defaultFS
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
Demo
1. Label cluster nodes
2. Stand up HDFS
3. Launch a Spark job
4. Check Spark job output
What about data locality?
• Read data from local disks when possible
• Remote reads can be slow if network is
slow
• Implemented in HDFS daemons, integrated
with app frameworks like Spark
Client 2
node B
Client 1
node A
Datanode 1 Datanode 2
SLOWFAST
HDFS data locality in YARN
Spark driver sends tasks to right executors with tasks’ HDFS data.
Driver
Executor 1 Executor 2
/fileA /fileB
Read /fileA
Read /fileB
(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →196.0.0.5)
Datanode 1 Datanode 2
node A
196.0.0.5
node B
196.0.0.6
Job 1
HDFS
YARN example
Hmm, how do I find right executors in Kubernetes...
(/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 → 10.0.0.3)
Executor Pod 2
10.0.1.2
Driver Pod Executor Pod 1
10.0.0.2 10.0.0.3
Read /fileB
Read /fileA
/fileA /fileB
Datanode Pod 1
node A
196.0.0.5
node B
196.0.0.6
Kubernetes example
Datanode Pod 2
Fix Spark Driver code
1. Ask Kubernetes master to find the cluster node where
the executor pod is running.
2. Get the node IP.
3. Compare with the datanode IPs.
Rescued data locality
Executor Pod 2
10.0.1.2
Driver Executor Pod 1
10.0.0.2 10.0.0.3
(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5)
Read /fileA
Read /fileB
/fileA /fileB
node A
196.0.0.5
node B
196.0.0.6
Datanode Pod 1 Datanode Pod 2
Rescued data locality!
with data locality fix
- duration: 10 minutes
without data locality fix
- duration: 25 minutes
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Recap
Got HDFS up and running.
Basic data locality works.
Open problems:
• Remaining data locality issues -- rack locality, node preference, etc
• Kerberos support
• Namenode High Availability
Join us!
● github.com/apache-spark-on-k8s
● pepperdata.com/careers/
More questions?
● Come to Pepperdata booth #101
● Mail kimoon@pepperdata.com

More Related Content

What's hot (20)

PPTX
Ozone and HDFS’s evolution
DataWorks Summit
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PPTX
Kafka 101
Clement Demonchy
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PPTX
Data In Motion Paris 2023
confluent
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
Ozone and HDFS’s evolution
DataWorks Summit
 
Introduction to Apache Flink
datamantra
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Introduction to Kafka Streams
Guozhang Wang
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Kafka 101
Clement Demonchy
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Data In Motion Paris 2023
confluent
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Productizing Structured Streaming Jobs
Databricks
 

Similar to HDFS on Kubernetes—Lessons Learned with Kimoon Kim (20)

PDF
Apache Spark on K8s and HDFS Security
Databricks
 
PDF
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Databricks
 
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PPTX
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
 
PDF
Big data and Kubernetes
Anirudh Ramanathan
 
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
PDF
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PPTX
Kubernetes fundamentals
Victor Morales
 
PDF
Getting Started with Apache Spark on Kubernetes
Databricks
 
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
PDF
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
PPTX
Kubernetes data science and machine learning
Kublr
 
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
PDF
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Databricks
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Apache Spark on K8s and HDFS Security
Databricks
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Databricks
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
 
Big data and Kubernetes
Anirudh Ramanathan
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Kubernetes fundamentals
Victor Morales
 
Getting Started with Apache Spark on Kubernetes
Databricks
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
Kubernetes data science and machine learning
Kublr
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Databricks
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

  • 1. Kimoon Kim ([email protected]) HDFS on Kubernetes -- Lessons Learned
  • 2. Outline 1. Kubernetes intro 2. Big Data on Kubernetes 3. Demo 4. Problems we fixed -- HDFS data locality
  • 3. Kubernetes New open-source cluster manager. Runs programs in Linux containers. 1000+ contributors and 40,000+ commits.
  • 4. “My app was running fine until someone installed their software”
  • 5. More isolation is good Kubernetes provides each program with: • a lightweight virtual file system – an independent set of S/W packages • a virtual network interface – a unique virtual IP address – an entire range of ports • etc
  • 6. Kubernetes architecture node A node B Pod 1 Pod 2 Pod 3 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Pod, a unit of scheduling and isolation. • runs a user program in a primary container • holds isolation layers like an virtual IP in an infra container
  • 7. Big Data on Kubernetes github.com/apache-spark-on-k8s • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. • patching up Spark Driver and Executor code to work on Kubernetes. Yesterday’s talk: spark-summit.org/2017/events/apache-spark-on-kuberne tes/
  • 8. Spark on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2
  • 9. What about storage? Spark often stores data on HDFS. How can Spark on Kubernetes access HDFS data?
  • 10. Hadoop Distributed File System Namenode • runs on a central cluster node. • maintains file system metadata. Datanodes • on every cluster node. • read and write file data on local disks.
  • 11. HDFS on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS hadoop.fs.defaultFS hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
  • 12. Demo 1. Label cluster nodes 2. Stand up HDFS 3. Launch a Spark job 4. Check Spark job output
  • 13. What about data locality? • Read data from local disks when possible • Remote reads can be slow if network is slow • Implemented in HDFS daemons, integrated with app frameworks like Spark Client 2 node B Client 1 node A Datanode 1 Datanode 2 SLOWFAST
  • 14. HDFS data locality in YARN Spark driver sends tasks to right executors with tasks’ HDFS data. Driver Executor 1 Executor 2 /fileA /fileB Read /fileA Read /fileB (/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →196.0.0.5) Datanode 1 Datanode 2 node A 196.0.0.5 node B 196.0.0.6 Job 1 HDFS YARN example
  • 15. Hmm, how do I find right executors in Kubernetes... (/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 → 10.0.0.3) Executor Pod 2 10.0.1.2 Driver Pod Executor Pod 1 10.0.0.2 10.0.0.3 Read /fileB Read /fileA /fileA /fileB Datanode Pod 1 node A 196.0.0.5 node B 196.0.0.6 Kubernetes example Datanode Pod 2
  • 16. Fix Spark Driver code 1. Ask Kubernetes master to find the cluster node where the executor pod is running. 2. Get the node IP. 3. Compare with the datanode IPs.
  • 17. Rescued data locality Executor Pod 2 10.0.1.2 Driver Executor Pod 1 10.0.0.2 10.0.0.3 (/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5) Read /fileA Read /fileB /fileA /fileB node A 196.0.0.5 node B 196.0.0.6 Datanode Pod 1 Datanode Pod 2
  • 18. Rescued data locality! with data locality fix - duration: 10 minutes without data locality fix - duration: 25 minutes
  • 20. Recap Got HDFS up and running. Basic data locality works. Open problems: • Remaining data locality issues -- rack locality, node preference, etc • Kerberos support • Namenode High Availability
  • 21. Join us! ● github.com/apache-spark-on-k8s ● pepperdata.com/careers/ More questions? ● Come to Pepperdata booth #101 ● Mail [email protected]