HDFS on Kubernetes—Lessons Learned with Kimoon Kim

24 likes9,182 views

This document outlines lessons learned from running HDFS on Kubernetes, covering key topics such as Kubernetes architecture, integration with big data tools like Spark, and the importance of data locality in enhancing performance. It details the challenges faced, including data locality issues and the solutions implemented to improve task execution efficiency. The document also encourages collaboration and further inquiries, providing links to relevant resources and career opportunities.

Data & Analytics

Kimoon Kim (kimoon@pepperdata.com)
HDFS on Kubernetes --
Lessons Learned

Outline
1. Kubernetes intro
2. Big Data on Kubernetes
3. Demo
4. Problems we fixed -- HDFS data locality

Kubernetes
New open-source cluster manager.
Runs programs in Linux containers.
1000+ contributors and 40,000+ commits.

“My app was running fine
until someone installed
their software”

More isolation is good
Kubernetes provides each program with:
• a lightweight virtual file system
– an independent set of S/W packages
• a virtual network interface
– a unique virtual IP address
– an entire range of ports
• etc

Kubernetes architecture
node A node B
Pod 1 Pod 2 Pod 3
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Pod, a unit of scheduling and isolation.
• runs a user program in a primary container
• holds isolation layers like an virtual IP in an infra container

Big Data on Kubernetes
github.com/apache-spark-on-k8s
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat,
and growing.
• patching up Spark Driver and Executor code to work on
Kubernetes.
Yesterday’s talk:
spark-summit.org/2017/events/apache-spark-on-kuberne
tes/

Spark on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2

What about storage?
Spark often stores data on HDFS.
How can Spark on Kubernetes access HDFS data?

Hadoop Distributed File System
Namenode
• runs on a central cluster node.
• maintains file system metadata.
Datanodes
• on every cluster node.
• read and write file data on local disks.

HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS
hadoop.fs.defaultFS
hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020

Demo
1. Label cluster nodes
2. Stand up HDFS
3. Launch a Spark job
4. Check Spark job output

What about data locality?
• Read data from local disks when possible
• Remote reads can be slow if network is
slow
• Implemented in HDFS daemons, integrated
with app frameworks like Spark
Client 2
node B
Client 1
node A
Datanode 1 Datanode 2
SLOWFAST

HDFS data locality in YARN
Spark driver sends tasks to right executors with tasks’ HDFS data.
Driver
Executor 1 Executor 2
/fileA /fileB
Read /fileA
Read /fileB
(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →196.0.0.5)
Datanode 1 Datanode 2
node A
196.0.0.5
node B
196.0.0.6
Job 1
HDFS
YARN example

Hmm, how do I find right executors in Kubernetes...
(/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 → 10.0.0.3)
Executor Pod 2
10.0.1.2
Driver Pod Executor Pod 1
10.0.0.2 10.0.0.3
Read /fileB
Read /fileA
/fileA /fileB
Datanode Pod 1
node A
196.0.0.5
node B
196.0.0.6
Kubernetes example
Datanode Pod 2

Fix Spark Driver code
1. Ask Kubernetes master to find the cluster node where
the executor pod is running.
2. Get the node IP.
3. Compare with the datanode IPs.

Rescued data locality
Executor Pod 2
10.0.1.2
Driver Executor Pod 1
10.0.0.2 10.0.0.3
(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5)
Read /fileA
Read /fileB
/fileA /fileB
node A
196.0.0.5
node B
196.0.0.6
Datanode Pod 1 Datanode Pod 2

Rescued data locality!
with data locality fix
- duration: 10 minutes
without data locality fix
- duration: 25 minutes

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Recap
Got HDFS up and running.
Basic data locality works.
Open problems:
• Remaining data locality issues -- rack locality, node preference, etc
• Kerberos support
• Namenode High Availability

Join us!
● github.com/apache-spark-on-k8s
● pepperdata.com/careers/
More questions?
● Come to Pepperdata booth #101
● Mail kimoon@pepperdata.com

More Related Content

What's hot (20)

PPTX

Ozone and HDFS’s evolutionDataWorks Summit

PDF

Introduction to Apache Flinkdatamantra

PDF

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

PDF

Apache Spark OverviewVadim Y. Bichutskiy

PDF

Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner

PDF

Introduction to Kafka StreamsGuozhang Wang

PDF

Introducing the Apache Flink Kubernetes OperatorFlink Forward

PPTX

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward

PDF

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

PPTX

Kafka 101Clement Demonchy

PDF

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

PDF

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

PDF

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

PDF

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI

PDF

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

PDF

Batch Processing at Scale with Flink & IcebergFlink Forward

PPTX

Data In Motion Paris 2023confluent

PPTX

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

PDF

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

PDF

Productizing Structured Streaming JobsDatabricks

Ozone and HDFS’s evolutionDataWorks Summit

Introduction to Apache Flinkdatamantra

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Spark OverviewVadim Y. Bichutskiy

Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner

Introduction to Kafka StreamsGuozhang Wang

Introducing the Apache Flink Kubernetes OperatorFlink Forward

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Kafka 101Clement Demonchy

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Batch Processing at Scale with Flink & IcebergFlink Forward

Data In Motion Paris 2023confluent

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Productizing Structured Streaming JobsDatabricks

Similar to HDFS on Kubernetes—Lessons Learned with Kimoon Kim (20)

PDF

Apache Spark on K8s and HDFS SecurityDatabricks

PDF

Apache Spark on K8S and HDFS Security with Ilan FlonenkoDatabricks

PPTX

Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit

PDF

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

PPTX

Data weekender deploying prod grade sql 2019 big data clustersChris Adkin

PDF

Big data and KubernetesAnirudh Ramanathan

PDF

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

PDF

Spark day 2017 - Spark on KubernetesYousun Jeong

PDF

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly

PDF

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PPTX

Kubernetes fundamentalsVictor Morales

PDF

Getting Started with Apache Spark on KubernetesDatabricks

PDF

Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau

PDF

Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...DoKC

PPTX

Kubernetes data science and machine learningKublr

PDF

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data

PDF

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks

PPTX

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen

PDF

Running Apache Spark Jobs Using KubernetesDatabricks

Apache Spark on K8s and HDFS SecurityDatabricks

Apache Spark on K8S and HDFS Security with Ilan FlonenkoDatabricks

Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Data weekender deploying prod grade sql 2019 big data clustersChris Adkin

Big data and KubernetesAnirudh Ramanathan

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

Spark day 2017 - Spark on KubernetesYousun Jeong

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Kubernetes fundamentalsVictor Morales

Getting Started with Apache Spark on KubernetesDatabricks

Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau

Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...DoKC

Kubernetes data science and machine learningKublr

18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen

Running Apache Spark Jobs Using KubernetesDatabricks

More from Databricks (20)

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

PDF

Machine Learning CI/CD for Email Attack DetectionDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Recently uploaded (20)

PDF

Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdfKPycho

PPTX

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

PDF

Data Science Course Certificate by Sigma Software UniversityStepan Kalika

PPTX

01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025FinTech Belgium

PPT

Growth of Public Expendituuure_55423.pptNavyaDeora

PDF

The Best NVIDIA GPUs for LLM Inference in 2025.pdfTamanna36

PDF

A GraphRAG approach for Energy Efficiency Q&AMarco Brambilla

PDF

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

PPTX

BinarySearchTree in datastructures in detailkichokuttu

PDF

1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdfsandeep718278

PDF

The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...Lal Chandran

PPTX

Powerful Uses of Data Analytics You Should Knowsubhashenia

PDF

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

PDF

Business implication of Artificial Intelligence.pdfVishalChugh12

PPTX

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

PPTX

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

PPTX

apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...apidays

PPTX

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

PPTX

apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...apidays

PPTX

apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...apidays

Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdfKPycho

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

Data Science Course Certificate by Sigma Software UniversityStepan Kalika

01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025FinTech Belgium

Growth of Public Expendituuure_55423.pptNavyaDeora

The Best NVIDIA GPUs for LLM Inference in 2025.pdfTamanna36

A GraphRAG approach for Energy Efficiency Q&AMarco Brambilla

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

BinarySearchTree in datastructures in detailkichokuttu

1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdfsandeep718278

The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...Lal Chandran

Powerful Uses of Data Analytics You Should Knowsubhashenia

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

Business implication of Artificial Intelligence.pdfVishalChugh12

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...apidays

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...apidays

apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...apidays

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

1. Kimoon Kim ([email protected]) HDFS on Kubernetes -- Lessons Learned

2. Outline 1. Kubernetes intro 2. Big Data on Kubernetes 3. Demo 4. Problems we fixed -- HDFS data locality

3. Kubernetes New open-source cluster manager. Runs programs in Linux containers. 1000+ contributors and 40,000+ commits.

4. “My app was running fine until someone installed their software”

5. More isolation is good Kubernetes provides each program with: • a lightweight virtual file system – an independent set of S/W packages • a virtual network interface – a unique virtual IP address – an entire range of ports • etc

6. Kubernetes architecture node A node B Pod 1 Pod 2 Pod 3 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Pod, a unit of scheduling and isolation. • runs a user program in a primary container • holds isolation layers like an virtual IP in an infra container

7. Big Data on Kubernetes github.com/apache-spark-on-k8s • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. • patching up Spark Driver and Executor code to work on Kubernetes. Yesterday’s talk: spark-summit.org/2017/events/apache-spark-on-kuberne tes/

8. Spark on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2

9. What about storage? Spark often stores data on HDFS. How can Spark on Kubernetes access HDFS data?

10. Hadoop Distributed File System Namenode • runs on a central cluster node. • maintains file system metadata. Datanodes • on every cluster node. • read and write file data on local disks.

11. HDFS on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS hadoop.fs.defaultFS hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020

12. Demo 1. Label cluster nodes 2. Stand up HDFS 3. Launch a Spark job 4. Check Spark job output

13. What about data locality? • Read data from local disks when possible • Remote reads can be slow if network is slow • Implemented in HDFS daemons, integrated with app frameworks like Spark Client 2 node B Client 1 node A Datanode 1 Datanode 2 SLOWFAST

14. HDFS data locality in YARN Spark driver sends tasks to right executors with tasks’ HDFS data. Driver Executor 1 Executor 2 /fileA /fileB Read /fileA Read /fileB (/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →196.0.0.5) Datanode 1 Datanode 2 node A 196.0.0.5 node B 196.0.0.6 Job 1 HDFS YARN example

15. Hmm, how do I find right executors in Kubernetes... (/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 → 10.0.0.3) Executor Pod 2 10.0.1.2 Driver Pod Executor Pod 1 10.0.0.2 10.0.0.3 Read /fileB Read /fileA /fileA /fileB Datanode Pod 1 node A 196.0.0.5 node B 196.0.0.6 Kubernetes example Datanode Pod 2

16. Fix Spark Driver code 1. Ask Kubernetes master to find the cluster node where the executor pod is running. 2. Get the node IP. 3. Compare with the datanode IPs.

17. Rescued data locality Executor Pod 2 10.0.1.2 Driver Executor Pod 1 10.0.0.2 10.0.0.3 (/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5) Read /fileA Read /fileB /fileA /fileB node A 196.0.0.5 node B 196.0.0.6 Datanode Pod 1 Datanode Pod 2

18. Rescued data locality! with data locality fix - duration: 10 minutes without data locality fix - duration: 25 minutes

20. Recap Got HDFS up and running. Basic data locality works. Open problems: • Remaining data locality issues -- rack locality, node preference, etc • Kerberos support • Namenode High Availability

21. Join us! ● github.com/apache-spark-on-k8s ● pepperdata.com/careers/ More questions? ● Come to Pepperdata booth #101 ● Mail [email protected]