Apache Spark cluster capacity and task running scenarios

View profile for Sumit Mittal

Founder @ TrendyTech.in | Data Engineering & GenAI Mentor | 400K+ Followers | Shaping the Next Generation of Data Engineers

Apache Spark - Scenario based question Lets say you have a 20 node spark cluster Each node is of size - 16 cpu cores / 64 gb RAM Let's say each node has 3 executors, with each executor of size - 5 cpu cores / 21 GB RAM => 1. What's the total capacity of cluster? We will have 20 * 3 = 60 executors Total CPU capacity: 60 * 5 = 300 cpu Cores Total Memory capacity: 60 * 21 = 1260 GB RAM => 2. How many parallel tasks can run on this cluster? We have 300 CPU cores, we can run 300 parallel tasks on this cluster. => 3. Let's say you requested for 4 executors then how many parallel tasks can run? so the capacity we got is 20 cpu cores 84 GB RAM so a total of 20 parallel tasks can run. => 4. Let's say we read a csv file of 10.1 GB stored in datalake and have to do some filtering of data, how many tasks will run? if we create a dataframe out of 10.1 GB file we will get 81 partitions in our dataframe. (will cover in my next post on how many partitions are created) so we have 81 partitions each of size 128 mb, the last partition will be a bit smaller. so our job will have 81 total tasks. but we have 20 cpu cores lets say each task takes around 10 second to process 128 mb data. so first 20 tasks run in parallel, once these 20 tasks are done the other 20 tasks are executed and so on... so totally 5 cycles, if we think the most ideal scenario. 10 sec + 10 sec + 10 sec + 10 sec + 8 sec first 4 cycles is to process 80 tasks all of 128 mb, last 8 sec is to process just one task of around 100 mb, so it takes little lesser but 19 cpu cores were free during this time. => 5. is there a possibility of, out of memory error in the above scenario? Each executor has 5 cpu cores and 21 gb ram. This 21 gb RAM is divided in various parts - 300 mb reserved memory, 40% user memory to store user defined variables/data. example hashmap 60% spark memory - this is divided 50:50 between storage memory and execution memory. so basically we are looking at execution memory and it will be around 28% roughly of the total memory allotted. so consider around 6 GB of 21 GB memory is meant for execution memory. per cpu core we have (6 GB / 5 cores) = 1.2 GB execution memory. That means our task can roughly handle around 1.2 GB of data. however, we are handling 128 mb so we are well under this range. I hope you liked the explanation :) Do mention in comment what you want me to bring in my next post! PS~ My new Data Engineering batch is starting on coming Saturday. DM to know more. #bigdata #dataengineering #apachespark

Aditya Ranjan

Data Engineer | Full Stack Developer | Angular, Java, Python | Passionate About Scalable Web Apps & Data Science

6mo

Interesting

Like
Reply

good one, jst thinking for a 10 gig csv, can we really split clean 128mb partitions ?? parsing or compression could change it right ? 🤔

Like
Reply
Ninad Lambat

Senior Software Engineer @ NiCE || Python || PLSQL || Pyspark || Postman || Jira || Power BI || Databricks || Snowflake || AWS EC2 || Dynamo DB || Microsoft certified Azure enterprise analyst

6mo

Very helpfull

Like
Reply
Venkata Satya Naresh Chundi

Senior Data Engineer | Python, PySpark, AWS, Hadoop | Big Data & Cloud Solutions | Banking, Credit Risk & Cryptocurrency Systems

6mo

Thanks for sharing the information very helpful

Like
Reply
Malik Quazi

Big Data Engineer at Oracle | Hadoop & Databricks Specialist | Oracle Cloud (OCI) Certified | Azure Synapse Spark | Microsoft Fabric | Cloud & Data Solutions | Team Leadership & Client-Facing Experience

6mo

This was a really good explanation simple and straight to the point. Loved how you broke down the cluster capacity, parallelism, and memory calculation step-by-step. Definitely looking forward to your next post on how partitions are decided, that’s something a lot of people struggle with. Thanks for sharing and making these concepts so easy to grasp!

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories