Apache Spark cluster capacity and task running scenarios

Founder @ TrendyTech.in | Data Engineering & GenAI Mentor | 400K+ Followers | Shaping the Next Generation of Data Engineers

6mo

Apache Spark - Scenario based question Lets say you have a 20 node spark cluster Each node is of size - 16 cpu cores / 64 gb RAM Let's say each node has 3 executors, with each executor of size - 5 cpu cores / 21 GB RAM => 1. What's the total capacity of cluster? We will have 20 * 3 = 60 executors Total CPU capacity: 60 * 5 = 300 cpu Cores Total Memory capacity: 60 * 21 = 1260 GB RAM => 2. How many parallel tasks can run on this cluster? We have 300 CPU cores, we can run 300 parallel tasks on this cluster. => 3. Let's say you requested for 4 executors then how many parallel tasks can run? so the capacity we got is 20 cpu cores 84 GB RAM so a total of 20 parallel tasks can run. => 4. Let's say we read a csv file of 10.1 GB stored in datalake and have to do some filtering of data, how many tasks will run? if we create a dataframe out of 10.1 GB file we will get 81 partitions in our dataframe. (will cover in my next post on how many partitions are created) so we have 81 partitions each of size 128 mb, the last partition will be a bit smaller. so our job will have 81 total tasks. but we have 20 cpu cores lets say each task takes around 10 second to process 128 mb data. so first 20 tasks run in parallel, once these 20 tasks are done the other 20 tasks are executed and so on... so totally 5 cycles, if we think the most ideal scenario. 10 sec + 10 sec + 10 sec + 10 sec + 8 sec first 4 cycles is to process 80 tasks all of 128 mb, last 8 sec is to process just one task of around 100 mb, so it takes little lesser but 19 cpu cores were free during this time. => 5. is there a possibility of, out of memory error in the above scenario? Each executor has 5 cpu cores and 21 gb ram. This 21 gb RAM is divided in various parts - 300 mb reserved memory, 40% user memory to store user defined variables/data. example hashmap 60% spark memory - this is divided 50:50 between storage memory and execution memory. so basically we are looking at execution memory and it will be around 28% roughly of the total memory allotted. so consider around 6 GB of 21 GB memory is meant for execution memory. per cpu core we have (6 GB / 5 cores) = 1.2 GB execution memory. That means our task can roughly handle around 1.2 GB of data. however, we are handling 128 mb so we are well under this range. I hope you liked the explanation :) Do mention in comment what you want me to bring in my next post! PS~ My new Data Engineering batch is starting on coming Saturday. DM to know more. #bigdata #dataengineering #apachespark

5 Comments

Aditya Ranjan

Data Engineer | Full Stack Developer | Angular, Java, Python | Passionate About Scalable Web Apps & Data Science

6mo

Interesting

Hareesh Malluri

6mo

good one, jst thinking for a 10 gig csv, can we really split clean 128mb partitions ?? parsing or compression could change it right ? 🤔

Ninad Lambat

Senior Software Engineer @ NiCE || Python || PLSQL || Pyspark || Postman || Jira || Power BI || Databricks || Snowflake || AWS EC2 || Dynamo DB || Microsoft certified Azure enterprise analyst

6mo

Very helpfull

Venkata Satya Naresh Chundi

Senior Data Engineer | Python, PySpark, AWS, Hadoop | Big Data & Cloud Solutions | Banking, Credit Risk & Cryptocurrency Systems

6mo

Thanks for sharing the information very helpful

Malik Quazi

6mo

This was a really good explanation simple and straight to the point. Loved how you broke down the cluster capacity, parallelism, and memory calculation step-by-step. Definitely looking forward to your next post on how partitions are decided, that’s something a lot of people struggle with. Thanks for sharing and making these concepts so easy to grasp!

See more comments

To view or add a comment, sign in

More Relevant Posts

Sumit Mittal

Founder @ TrendyTech.in | Data Engineering & GenAI Mentor | 400K+ Followers | Shaping the Next Generation of Data Engineers
3w
Report this post
Internal working of Apache Spark - One of the most liked writeup Lets say you have a 20 node spark cluster Each node is of size - 16 cpu cores / 64 gb RAM Let's say each node has 3 executors, with each executor of size - 5 cpu cores / 21 GB RAM => 1. What's the total capacity of cluster? We will have 20 * 3 = 60 executors Total CPU capacity: 60 * 5 = 300 cpu Cores Total Memory capacity: 60 * 21 = 1260 GB RAM => 2. How many parallel tasks can run on this cluster? We have 300 CPU cores, we can run 300 parallel tasks on this cluster. => 3. Let's say you requested for 4 executors then how many parallel tasks can run? so the capacity we got is 20 cpu cores 84 GB RAM so a total of 20 parallel tasks can run. => 4. Let's say we read a csv file of 10.1 GB stored in datalake and have to do some filtering of data, how many tasks will run? if we create a dataframe out of 10.1 GB file we will get 81 partitions in our dataframe. (will cover in my next post on how many partitions are created) so we have 81 partitions each of size 128 mb, the last partition will be a bit smaller. so our job will have 81 total tasks. but we have 20 cpu cores lets say each task takes around 10 second to process 128 mb data. so first 20 tasks run in parallel, once these 20 tasks are done the other 20 tasks are executed and so on... so totally 5 cycles, if we think the most ideal scenario. 10 sec + 10 sec + 10 sec + 10 sec + 8 sec first 4 cycles is to process 80 tasks all of 128 mb, last 8 sec is to process just one task of around 100 mb, so it takes little lesser but 19 cpu cores were free during this time. => 5. is there a possibility of, out of memory error in the above scenario? Each executor has 5 cpu cores and 21 gb ram. This 21 gb RAM is divided in various parts - 300 mb reserved memory, 40% user memory to store user defined variables/data. example hashmap 60% spark memory - this is divided 50:50 between storage memory and execution memory. so basically we are looking at execution memory and it will be around 28% roughly of the total memory allotted. so consider around 6 GB of 21 GB memory is meant for execution memory. per cpu core we have (6 GB / 5 cores) = 1.2 GB execution memory. That means our task can roughly handle around 1.2 GB of data. however, we are handling 128 mb so we are well under this range. I hope you liked the explanation :) Do mention in comment what you want me to bring in my next post! if you want to experience a learning like never before & want to make a career in big data then DM me. New batch is starting on coming Saturday. #bigdata #career #dataengineering #apachespark
20 Comments
Like Comment
To view or add a comment, sign in
Pavan Kumar Pantulwar

Data Engineer @compoundexpress
3w
Report this post
🔍 Internal Working of Apache Spark — Explained in the Most Simple Way -->Let’s say you have a 20-node Spark cluster. Each node = 16 CPU cores / 64 GB RAM Each node runs 3 executors, with each executor having: 🧠 5 CPU cores, 💾 21 GB RAM ---------------------------------- ⚙️ 1️⃣ Total Cluster Capacity 👉 Total executors = 20 × 3 = 60 executors 👉 Total CPU = 60 × 5 = 300 cores 👉 Total Memory = 60 × 21 = 1260 GB RAM So, our Spark cluster can utilize 300 cores and 1.26 TB memory in total. ---------------------------------- 🚀 2️⃣ How many parallel tasks can run? Each CPU core can run one task at a time. So, with 300 cores, Spark can run 300 parallel tasks. ---------------------------------- 🧩 3️⃣ If you request 4 executors for your job… 4 executors × 5 cores = 20 CPU cores 4 executors × 21 GB = 84 GB RAM So your job can run 20 tasks in parallel (since you have 20 cores available). ---------------------------------- 📂 4️⃣ Reading a 10.1 GB CSV file from Data Lake Let’s say you read a CSV file of size 10.1 GB into a DataFrame and apply a simple filter transformation. By default, Spark divides the file into partitions of 128 MB (controlled by spark.sql.files.maxPartitionBytes). 👉 10.1 GB ≈ 10,100 MB 👉 10,100 / 128 ≈ 79 partitions Plus 1 or 2 smaller ones — so roughly 81 partitions in total. Each partition = 1 Spark task. Hence, this job will have 81 total tasks. --------------------------------- 🧮 5️⃣ How Spark executes those tasks Since you have 20 cores, Spark runs 20 tasks in parallel. Once those finish, the next 20 start, and so on… ✅ Cycle 1 → 20 tasks (10 sec) ✅ Cycle 2 → 20 tasks (10 sec) ✅ Cycle 3 → 20 tasks (10 sec) ✅ Cycle 4 → 20 tasks (10 sec) ✅ Cycle 5 → 1 remaining task (~8 sec) 👉 Total time ≈ 48 seconds (ideal scenario) Notice that in the last cycle, only one task runs — 19 cores remain idle. This is normal in the final stage. -------------------------------- 💣 6️⃣ Can OutOfMemory occur here? Let’s analyze executor memory: Each executor = 21 GB RAM, 5 cores ~300 MB reserved by Spark Remaining split as: 40% → User memory (for variables, UDFs, etc.) 60% → Spark memory Split 50:50 into storage and execution memory 👉 Execution memory ≈ 0.6 × 0.5 × 21 GB ≈ 6.3 GB Per core = 6.3 GB / 5 ≈ 1.2 GB per task Each task is processing 128 MB, which is far below 1.2 GB — so no OOM risk here 🚫 💡Why partition size matters A good partition size balances parallelism and overhead. Too small → Too many tasks = scheduler overhead Too large → Fewer tasks = underutilized CPUs For most workloads, 100–200 MB per partition is optimal. What’s your go-to strategy for tuning partitions or executor configs? Drop your thoughts below 👇 Inspired by Sumit Mittal post.
Like Comment
To view or add a comment, sign in
Manish R G

Data Engineer | Python, SQL, AWS, Snowflake | Building Scalable Data Pipelines
3w
Report this post
🚀 Internal Working of Apache Spark — Simplified Explanation Let’s say you have a 20-node Spark cluster. Each node has: 🧠 16 CPU cores | 💾 64 GB RAM Each node runs 3 executors, and each executor is configured with: ⚙️ 5 CPU cores | 💽 21 GB RAM ⸻ 🔹 1️⃣ Total Cluster Capacity • Executors: 20 nodes × 3 executors = 60 executors • Total CPU: 60 × 5 cores = 300 cores • Total Memory: 60 × 21 GB = 1260 GB RAM ⸻ 🔹 2️⃣ How Many Parallel Tasks Can Run? Each task typically occupies one CPU core. ➡️ So, 300 parallel tasks can run across the cluster. ⸻ 🔹 3️⃣ If You Request 4 Executors… • CPU: 4 × 5 = 20 cores • Memory: 4 × 21 GB = 84 GB RAM ➡️ You can run 20 parallel tasks. ⸻ 🔹 4️⃣ Reading a 10.1 GB CSV File Let’s say you load a 10.1 GB CSV file from your data lake and perform a simple filter operation. • The DataFrame will be divided into ~81 partitions (each of ~128 MB). • Each partition = one task → 81 total tasks in the job. • Available CPU = 20 cores → 20 tasks run in parallel. If each task takes ~10 seconds to process 128 MB, then: ⏱️ Cycle 1 → 20 tasks → 10 sec ⏱️ Cycle 2 → next 20 tasks → 10 sec ⏱️ Cycle 3 → next 20 tasks → 10 sec ⏱️ Cycle 4 → next 20 tasks → 10 sec ⏱️ Cycle 5 → last task (~100 MB) → ~8 sec ✅ So total runtime ≈ 48 seconds in an ideal scenario. ⸻ 🔹 5️⃣ Will It Cause Out Of Memory (OOM)? Each executor = 21 GB RAM • ~300 MB reserved • ~40% user memory (for UDFs, variables, etc.) • ~60% Spark memory → split 50:50 between storage and execution So, execution memory ≈ 6 GB per executor. Each core → 6 GB / 5 cores = 1.2 GB execution memory available. Since each task processes 128 MB, we’re well below this threshold ✅ 👉 No OOM errors expected here. ⸻ 💡 In summary: Understanding Spark’s internal resource distribution helps you design jobs efficiently, avoid bottlenecks, and tune performance smartly. ⸻ 📘 In my next post, I’ll cover how Spark decides the number of partitions when reading a file. #ApacheSpark #BigData #DataEngineering #SparkOptimization #ClusterComputing #DataScience
Like Comment
To view or add a comment, sign in
Shikhar Mishra

B.Tech in CSE-AIML ’27 | Apache Spark | Databricks | SIH 2025 Grand Finalist
1mo
Report this post
Your Spark query is stuck on "Listing leaf files...". The CPU is idle, yet the job takes minutes to even start. What's the hidden bottleneck? You're likely a victim of 𝐃𝐢𝐫𝐞𝐜𝐭𝐨𝐫𝐲 𝐒𝐜𝐚𝐧 𝐎𝐯𝐞𝐫𝐡𝐞𝐚𝐝, a silent performance killer in data lakes, especially on object storage like S3 or ADLS. Before Spark can process a single byte of data, it must first discover what to process. It does this by making listStatus API calls to your storage to build an InMemoryFileIndex of all the files and directories. Imagine you’re a warehouse manager without an inventory system. To fill an order, you must first walk down every aisle, look at every shelf, and manually write down the location of every single box. This is exactly what Spark does. This "discovery" phase can become the main bottleneck due to: ⚠️ 𝐎𝐯𝐞𝐫-𝐏𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐢𝐧𝐠: More partitions => more directories. Spark has to recursively and sequentially make an API call for each one. A thousand partitions can mean a thousand sequential, high-latency calls before any real work begins. ⚠️ 𝐓𝐡𝐞 𝐒𝐦𝐚𝐥𝐥 𝐅𝐢𝐥𝐞𝐬 𝐏𝐥𝐚𝐠𝐮𝐞: Millions of small files mean your "inventory list" (the metadata) can become so massive it explodes the Spark driver's memory (OOM errors) and creates scheduling overhead. Unlike a local file system, listing files on object storage is slow. Every API call adds precious seconds or even minutes to your job's startup time. So, how do we fix it? You don't fix a broken inventory process by walking faster. You fix it by getting a computer. 💡 𝐓𝐡𝐞 𝐔𝐥𝐭𝐢𝐦𝐚𝐭𝐞 𝐅𝐢𝐱: 𝐒𝐭𝐨𝐩 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐢𝐧𝐠, 𝐒𝐭𝐚𝐫𝐭 𝐊𝐧𝐨𝐰𝐢𝐧𝐠. Instead of scanning the file system on every query, use an external metadata catalog like Hive Metastore, or Databricks Unity Catalog. This catalog stores the file listings. Spark can then query the catalog instantly to get a precise "picklist" of files, completely bypassing the slow discovery phase. Supporting Strategies: ✅ Smart Partitioning: Don't partition on high-cardinality columns. Choose meaningful partitions that align with your query patterns to avoid an explosion of directories. ✅ File Compaction: Regularly merge small files into larger, optimally sized ones (e.g., 128MB - 1GB). This dramatically reduces the amount of metadata. #DataEngineering #ApacheSpark #BigData #DataLake #PerformanceTuning #Optimization #AWS #Azure #Databricks
Like Comment
To view or add a comment, sign in
Alejandro Paúl Aldas

Data Scientist at Pontificia Universidad Católica del Ecuador
2w
Report this post
Proposed Solution: Simple Anomaly Detection 🐍 The Problem A server is logging its daily CPU usage (as a percentage). We want a quick way to flag any day where the usage is significantly higher than the average of the last N days, suggesting a potential issue like an unexpected process or a denial-of-service attack. The #Python Solution This solution uses a simple standard deviation threshold to determine an anomaly. import numpy as np def detect_anomaly(data, new_value, threshold_std=2): """ Detects if a new_value is an anomaly based on the mean and standard deviation of the existing data. Args: data (list or array): The historical time-series data (e.g., past N days' CPU usage). new_value (float): The latest data point to check. threshold_std (int): Number of standard deviations from the mean to consider a value anomalous. Returns: tuple: (bool, float, float): (is_anomaly, mean_usage, std_dev) """ if not data: # Cannot calculate stats without data return False, 0.0, 0.0 # Convert data to a numpy array for easy statistical calculation data_array = np.array(data) # Calculate the mean and standard deviation of the historical data mean_usage = np.mean(data_array) std_dev = np.std(data_array) # Define the upper limit for 'normal' behavior upper_bound = mean_usage + threshold_std * std_dev # Check for anomaly is_anomaly = new_value > upper_bound return is_anomaly, mean_usage, std_dev # --- Example Usage --- # Historical CPU usage for the last 7 days (as percentages) historical_data = [45.5, 50.1, 47.8, 51.0, 46.5, 49.2, 50.5] # New CPU usage to check today_usage_normal = 52.0 # A slightly high but likely normal day today_usage_anomaly = 65.0 # A significantly high day # Test Case 1: Normal Day is_anomaly_n, mean_n, std_n = detect_anomaly(historical_data, today_usage_normal) print(f"Historical Mean: {mean_n:.2f}% | Historical Std Dev: {std_n:.2f}%") print("-" * 35) print(f"Today's Usage: {today_usage_normal}%") print(f"Anomaly Detected (Normal Day)? {is_anomaly_n}") print("\n" + "=" * 35 + "\n") # Test Case 2: Anomaly Day is_anomaly_a, mean_a, std_a = detect_anomaly(historical_data, today_usage_anomaly) print(f"Historical Mean: {mean_a:.2f}% | Historical Std Dev: {std_a:.2f}%") print("-" * 35) print(f"Today's Usage: {today_usage_anomaly}%") print(f"Anomaly Detected (Anomaly Day)? {is_anomaly_a}")
Like Comment
To view or add a comment, sign in
Shrishty Srivastava

Senior Data Engineer at OLX | Big Data | Spark, AWS, PySpark, Python, Pinot, SQL | AWS Certified Solutions Architect Associate
1mo
Report this post
𝗧𝗵𝗲 𝗥𝗼𝗹𝗲 𝗼𝗳 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗶𝗻 𝗦𝗽𝗮𝗿𝗸 Apache Spark processes large, distributed datasets across clusters. Many operations (joins, group-bys, shuffles, caching, reading/writing) are I/O and memory intensive, often leading to performance bottlenecks. 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 helps reduce data size and boosts performance by: Storage → Smaller files, faster read/write. Network transfers → Less shuffle data, reduced congestion. Memory usage → Compressed caching allows more data to fit in RAM. Trade-off: Compression introduces CPU overhead for encoding/decoding. The key is finding the right balance between CPU cost vs. I/O & memory gains. 𝗪𝗵𝘆 𝗨𝘀𝗲 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗶𝗻 𝗦𝗽𝗮𝗿𝗸? Faster I/O → Smaller files load quicker from HDFS, S3, or disk. Reduced Shuffle Overhead → Saves network bandwidth. Efficient Memory Use → Enables caching larger datasets in executors. Lower Cloud Costs → Less storage + fewer compute hours = cost savings. 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 𝗶𝗻 𝗦𝗽𝗮𝗿𝗸 1 Data Storage → Parquet, ORC, JSON, CSV, Avro. 2 Shuffle Data → During groupBy, join, repartition. 3 In-Memory Persistence → Compressed cache levels (MEMORY_AND_DISK_SER). 4 Broadcast Variables → Compact transfer of small datasets to executors. 5 RDD Serialization → Compressing serialized RDDs for efficient transfer. Smart compression can significantly improve Spark performance, reduce costs, and optimize cluster resources. #ApacheSpark #BigData #DataEngineering #SparkSQL #DataProcessing #CloudComputing #Parquet #DataOptimization #ETL
3 Comments
Like Comment
To view or add a comment, sign in
Ratnesh K.

Lead DBA @ ANR Software | Database Expertise - Oracle|Postgresql
3w
Report this post
ANALYZE HELP When CPU Goes Through the Roof The first thing we noticed? Every time this query kicked off, the server's CPU would go haywire—shooting up to 80% or more. That kind of spike isn't normal. It told us PostgreSQL was grinding away at something, probably making bad decisions about how to actually run the query. This Wasn't Your Average Query We took a look under the hood, and honestly, it was a monster. This thing was joining six different tables—and some of them showed up multiple times in different parts of the query. Following the logic felt like untangling Christmas lights. Normally, when you see a query this complicated, you'd think about refactoring it. But that wasn't realistic here. The business logic was deeply embedded, and touching it could've broken critical workflows. We needed a different approach. The Lightbulb Moment Then we dug a little deeper and found the smoking gun: none of the tables in those joins had ever been analyzed. Not once. That meant PostgreSQL had zero clue about what the data actually looked like—how big the tables were, how values were distributed, nothing. Without that information, the query planner was flying blind, making guesses that turned out to be way off base. The Fix That Changed Everything We decided to run the ANALYZE command manually on every table the query touched. Then we crossed our fingers and ran the query again. The results? Honestly, we couldn't believe it: CPU usage dropped from 60%+ down to less than 10%—an 83% reduction in load What Actually Happened Here PostgreSQL's query planner depends on table statistics to figure out the smartest way to execute a query. Think of it like using GPS versus driving blindfolded—the planner needs accurate information to avoid inefficient routes. When statistics are missing or stale, PostgreSQL makes poor assumptions about your data. It might scan entire tables when an index would've been faster, or pick join strategies that waste CPU cycles. Once we updated those statistics with ANALYZE, the planner could finally see the real picture and chose a drastically more efficient path. The Bigger Lesson Here's what this taught us: keeping PostgreSQL statistics fresh isn't optional maintenance—it's mission-critical. If you're seeing unexpectedly slow queries or CPU spikes, check when ANALYZE last ran on your tables. PostgreSQL's autovacuum process should handle this automatically, but sometimes it doesn't keep up—especially after bulk data loads or on very active tables. When that happens, you might need to manually run ANALYZE or tune your autovacuum settings. Bottom Line One simple command—ANALYZE—took a system on the edge of collapse and made it fly. It's a reminder that sometimes the most powerful optimizations aren't about rewriting queries or adding hardware. Regular monitoring and maintenance of your statistics can save you countless hours of troubleshooting and keep your database running smoothly when it matters most.

3 Comments
Like Comment
To view or add a comment, sign in
Dinesh Kumar

Aspiring Data Engineer | Cloud & Big Data (Azure, AWS, Snowflake) | Python, SQL | Open to Internship Opportunities
1mo
Report this post
⚙️ 1️⃣ What Exactly Is a Node in Apache Spark? Ever wondered how Spark manages to process massive datasets in seconds? 🤔 The secret lies in how it’s built — around Nodes 💻 💡 Simple Definition: A Node is basically one computer (physical or virtual) that’s part of a larger cluster 🧠 Each node: 🔹 Has its own CPU, memory, and storage 🔹 Runs a portion of the total computation 🔹 Talks to other nodes over a network Together — they form a distributed system that crunches huge data in parallel ⚡ 🧠 In Spark, not all nodes are the same… 💼 Type🔍 Description🧭 Master Node (Driver Node)The brain 🧠 — coordinates, schedules, and manages tasks for all workers.⚙️ Worker NodeThe muscle 💪 — performs actual computations and stores data partitions.💻 Executor NodeThe worker inside a worker 🧩 — executes transformations (like map, filter, reduce) on data. 🖥️ Imagine this setup: You’ve got 5 computers in your Spark cluster: 🧭 1 Master Node — gives instructions ⚙️ 4 Worker Nodes — follow, compute, and store data Each Worker Node handles: 🗂️ A partition of the dataset ⚙️ Local computation on its data ⏱️ All run simultaneously → blazing-fast parallel processing ✅ In short: A Node = One computer that plays a vital role in distributed data processing. It’s how Spark achieves speed, scale, and reliability 🚀 📊 Visual Insight: (Attach your Spark Node image — “Master Node → Worker Nodes → Executor Nodes”) 💬 If you had to explain a “Node” to a non-technical friend — How would you describe it? 👇 #ApacheSpark #BigData #DataEngineering #DistributedComputing #SparkArchitecture #Databricks #ClusterComputing #DataScience #LearningSpark
Like Comment
To view or add a comment, sign in
Rémy Boutonnet

Founder @ TuringDB | Fast graph database guy
1mo Edited
Report this post
We made TuringDB as a column-oriented graph database in C++ to reap the performance benefits of processing nodes and edges in chunks, by blocks or columns, instead of node-by-node or edge-by-edge in most commercial graph databases today. This is column-oriented query processing vs. Volcano-style query processing. Seminal research from Boncz et al., 2005 https://lnkd.in/dmFv5q5M showed that Volcano-style execution by processing one item at a time lacks natural concurrency and inhibits performance, with poor data locality, which is bad for modern CPUs caches, and prevents vectorisation. So this is why TuringDB query engine processes nodes & edges by chunks at a time, and can take advantage of modern CPU hardware, be very cache friendly and leverage SIMD vector instructions (Single Instruction, Multiple Data) which can apply the same operation to multiple data elements simultaneously. Intel AVX in its many flavors and the like, if you know.. It's happening here 👉 https://lnkd.in/dH6TH8cY If you want to know more, see the docs at 👉 https://lnkd.in/d7tsC2-t

Columnar Database Concepts - TuringDB docs.turingdb.ai
Like Comment
To view or add a comment, sign in
Kapil Uthra

Driving Digital Transformation | AI & Cloud Enthusiast | OpenText ECM/xECM Expert
2w
Report this post
🎯 𝗠𝗮𝘅𝗗𝗢𝗣 & 𝗖𝗼𝘀𝘁 𝗧𝗵𝗿𝗲𝘀𝗵𝗼𝗹𝗱 — 𝗧𝗵𝗲 𝗠𝗼𝘀𝘁 𝗠𝗶𝘀𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗼𝗼𝗱 𝗦𝗤𝗟 𝗦𝗲𝘁𝘁𝗶𝗻𝗴𝘀 When your OpenText Content Management system feels slow but CPU looks fine, it might be because SQL Server is doing too much work in parallel — or not enough. 𝗧𝘄𝗼 𝘀𝗲𝘁𝘁𝗶𝗻𝗴𝘀 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝘁𝗵𝗮𝘁 𝗯𝗮𝗹𝗮𝗻𝗰𝗲: MaxDOP (Maximum Degree of Parallelism) and Cost Threshold for Parallelism. Let’s keep it simple 👇 ⚙️ 𝗪𝗵𝗮𝘁’𝘀 𝗿𝗲𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝗶𝗻𝗴 𝘂𝗻𝗱𝗲𝗿 𝘁𝗵𝗲 𝗵𝗼𝗼𝗱 When SQL gets a query, it asks: “Should I use multiple CPU cores or just one?” MaxDOP controls how many cores a single query can use. Cost Threshold decides when a query is “heavy enough” to run in parallel. The defaults are ancient — set for hardware from 20 years ago. That’s why modern systems often see CXPACKET or THREADPOOL waits: SQL is over-parallelizing tiny queries. 🚀 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗲𝗱 𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝗽𝗼𝗶𝗻𝘁 (𝗳𝗼𝗿 𝗢𝗟𝗧𝗣 𝘄𝗼𝗿𝗸𝗹𝗼𝗮𝗱𝘀 𝗹𝗶𝗸𝗲 𝗢𝗽𝗲𝗻𝗧𝗲𝘅𝘁 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁) 1️⃣ Set MaxDOP = number of cores per NUMA node (max 8) 2️⃣ Increase Cost Threshold for Parallelism from default 5 → 50 3️⃣ Test under normal load before and after changing. 4️⃣ Watch waits — if CXPACKET or THREADPOOL drop, you’re on the right track. 💡 𝗥𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝘀𝘁𝗼𝗿𝘆 A customer running OpenText Content Management on SQL 2019 had 32 cores. Queries were “parallelizing to death” — short OLTP lookups were fighting with report queries for CPU. 𝗔𝗳𝘁𝗲𝗿 𝘁𝘂𝗻𝗶𝗻𝗴: MaxDOP = 8 Cost Threshold = 50 Average response time dropped by 25% CPU usage became smooth and predictable 𝙽̲𝚘̲ ̲𝚎̲𝚡̲𝚝̲𝚛̲𝚊̲ ̲𝚑̲𝚊̲𝚛̲𝚍̲𝚠̲𝚊̲𝚛̲𝚎̲.̲ ̲𝙹̲𝚞̲𝚜̲𝚝̲ ̲𝚜̲𝚖̲𝚊̲𝚛̲𝚝̲𝚎̲𝚛̲ ̲𝚜̲𝚎̲𝚝̲𝚝̲𝚒̲𝚗̲𝚐̲𝚜̲.̲ 🧩 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁’𝘀 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆 Don’t leave MaxDOP and Cost Threshold at defaults — they define how your CPU is used. Architects should set and test them like they set thread pools in an app server. Get your DBA involved early, benchmark, and document what “good” looks like. 𝚂̲𝚖̲𝚊̲𝚕̲𝚕̲ ̲𝚝̲𝚠̲𝚎̲𝚊̲𝚔̲.̲ ̲𝙷̲𝚞̲𝚐̲𝚎̲ ̲𝚒̲𝚖̲𝚙̲𝚊̲𝚌̲𝚝̲.̲
Like Comment
To view or add a comment, sign in

307,143 followers

View Profile Connect

LinkedIn respects your privacy

Apache Spark cluster capacity and task running scenarios

More from this author

Latest Trends in BigData Engineering

Explore content categories