Big data solution capacity planning

Download as PPTX, PDF

•5 likes•720 views

The document outlines the capacity planning for a big data solution, detailing the setup of an analytical and alerting system for data from 10,000 servers generating approximately 10 million events or 50 GB of data daily. It specifies the architecture involving components like Hadoop, Spark, Kafka, Elasticsearch, and HDFS, with storage requirements calculated to approximately 6.15 TB monthly, accounting for growth rates of 10% quarterly and 15% annually. Hardware specifications for data and namenodes are provided, emphasizing the need for specific CPU, memory, and storage configurations.

Technology

Hello!
I am Riyaz A Shaikh
Full Stack Architect
You can find me at:
@jf @rizAShaikh
Riyaz A Shaikh
www.riyazshaikh.com

Requirement
Need to setup analytical and alerting system
on data produced by 10,000 servers.
Assuming 10 million events generated per
day by all servers. Considering 50 GB of data
per day.

Big Data Cluster
Considering Hortonworks Hadoop
distribution for cluster setup with following
systems.
 HDFS for data backup in compressed
format.
 Spark for data computation and
transformation.
 Apache Kafka as messaging service for
data completeness.
 Flume for data capture
 Elasticsearch for Analytical data storage
and search engine.
 Kibana for data visualization

Kafka cluster capacity
Assumption Size in GB Rationale
Daily average raw data ingest rate 50
Kafka retention period of 2 days 100 Raw data * retention period
Kafka replication factor of 3 300 Raw data * retention period * replication factor
Storage per Day 300 GB
Storage per Month
This is staging. Monthly calculation is not required because
data will be auto purged after retention period.
Table 1

Elasticsearch cluster capacity
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
Elasticsearch 3 shards 50
Shards are index split. No
extra space required.
Elasticsearch 3 replica 150 Raw data * replicas
Each shards will be
replicated 3 times
Storage per Day 150 GB
Storage per Month 4500 GB Per day * 30 4.5 TB per month
Table 2

HDFS to backup Elasticsearch data
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
HDFS replication factor by 3 150 Raw data * replication factor
70 % Compression 45 (150 – (150*70/100)) LZO compression
Storage per Day 45 GB
Storage per Month 1350 GB 1.35 TB per month
Table 3

Typical Node structure
Table 4
Node Structure
Typical per data node storage capacity
4 TB 2 X 2 TB HDD
Temp space for processing by Spark,
Map Reduce etc. 1 TB 25% of the data node
Data node usable storage
3 TB
Raw storage - Spark
reserve
Considering storage capacity from above three tables
Table 1, Table 2 and Table 3.
Total storage required per month is
300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)

“
Assuming 10% data growth per quarter. Further, considering
15% year-on-year growth in data volume.
Below given Table 5 indicated capacity required as per data
growth year-on-year

Capacity growth year-on-year
Table 5
10% Data Growth Quarterly (Data in TB)
Quarter Year 1 Year 2 Year 3 Year 4 Year 5
Q1 6.15 9.4 12.5 16.7 22.2
Q2 6.8 9.9 13.2 17.5 23.3
Q3 7.4 10.4 13.8 18.4 24.5
Q4 8.2 10.9 14.5 19.3 25.7
Yearly storage 28.5 40.6 54.0 71.9 95.7
Data nodes required =
yearly storage / Data node usable storage
10 14 18 24 32

Hardware Specs
Considering one year storage on ten data
node with one Namenode and one standby
Namenode.
Table 6 & 7 shows hardware configuration of
each machines.

Typical worker node hardware configurations
Table 6
Midline configuration (Data Node)
CPU 2 × 8 core 2.9 Ghz
Memory 64 GB DDR3-1600 ECC
Disk controller SAS 6 Gb/s
Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.

Typical Namenode hardware configurations
Table 7
Namenode configuration
CPU 2 × 8 core 2.9 Ghz
Memory 128 GB
Disk controller RAID 1
Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.

Thanks!
Any questions & feedback!
Write to me at:
@rizAShaikh
Shaikh.r.a@gmail.com

More Related Content

PPT

Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network

PPT

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

PDF

Hadoop and Hive Development at Facebookelliando dias

PDF

hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack

PDF

Hadoop - Simple. Scalable.elliando dias

PDF

Beginner Apache Spark PresentationNidhin Pattaniyil

KEY

Hadoop導入事例 in クックパッドTatsuya Sasaki

PPTX

HadoopJaydeep Patel

Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Hadoop and Hive Development at Facebookelliando dias

hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack

Hadoop - Simple. Scalable.elliando dias

Beginner Apache Spark PresentationNidhin Pattaniyil

Hadoop導入事例 in クックパッドTatsuya Sasaki

HadoopJaydeep Patel

What's hot (20)

PDF

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

PPT

Coriani 2Innocenti Andrea

PPT

MongoDB @ fliptopRobbie Cheng

KEY

Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba

PDF

introduction to data processing using Hadoop and PigRicardo Varela

PPTX

Can the elephants handle the no sql onslaughtAung Thu Rha Hein

PPTX

Big dataSampath Bhargav Pinnam

PPTX

BioPig for scalable analysis of big sequencing dataZhong Wang

PDF

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

PDF

PgconfSV compressionAnastasia Lubennikova

PDF

Hadoop Architecture in DepthSyed Hadoop

PDF

Introduction to Hadoop - FinistJugDavid Morin

PPTX

R&D for L&DMegan Bowe

PDF

Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData

PPT

SUNY Ulster - GIS Program Server Storage OptionsMichael Dobe, Ph.D.

PDF

Hadoop breizhjugDavid Morin

KEY

Getting Started on HadoopPaco Nathan

PDF

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

PDF

Aerospike Nested CDTs - Meetup Dec 2019Aerospike

PDF

20170210 sapporotechbar7Ryuji Tamagawa

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

Coriani 2Innocenti Andrea

MongoDB @ fliptopRobbie Cheng

Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba

introduction to data processing using Hadoop and PigRicardo Varela

Can the elephants handle the no sql onslaughtAung Thu Rha Hein

Big dataSampath Bhargav Pinnam

BioPig for scalable analysis of big sequencing dataZhong Wang

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

PgconfSV compressionAnastasia Lubennikova

Hadoop Architecture in DepthSyed Hadoop

Introduction to Hadoop - FinistJugDavid Morin

R&D for L&DMegan Bowe

Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData

SUNY Ulster - GIS Program Server Storage OptionsMichael Dobe, Ph.D.

Hadoop breizhjugDavid Morin

Getting Started on HadoopPaco Nathan

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

Aerospike Nested CDTs - Meetup Dec 2019Aerospike

20170210 sapporotechbar7Ryuji Tamagawa

Recently uploaded (20)

PDF

Make GenAI investments go further with the Dell AI FactoryPrincipled Technologies

PDF

OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian

PDF

The Future of Mobile Is Context-Aware—Are You Ready?iProgrammer Solutions Private Limited

PPTX

Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...AgileNetwork

PDF

Research-Fundamentals-and-Topic-Development.pdfayesha butalia

PPTX

Simple and concise overview about Quantum computing..pptxmughal641

PDF

Orbitly Pitch Deck｜A Mission-Driven Platform for Side Project Collaboration (...zz41354899

PDF

Doc9.....................................SofiaCollazos

PDF

Google I/O Extended 2025 Baku - all pptsHusseinMalikMammadli

PDF

Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...Sandesh Rao

PPTX

Introduction to Flutter by Ayush Desai.pptxayushdesai204

PDF

Trying to figure out MCP by actually building an app from scratch with open s...Julien SIMON

PDF

Security features in Dell, HP, and Lenovo PC systems: A research-based compar...Principled Technologies

PPTX

What-is-the-World-Wide-Web -- Introductiontonifi9488

PPTX

AI and Robotics for Human Well-being.pptxJAYMIN SUTHAR

PDF

The Future of Artificial Intelligence (AI)Mukul

PDF

Brief History of Internet - Early Days of Internetsutharharshit158

PDF

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

PPTX

AI in Daily Life: How Artificial Intelligence Helps Us Every Dayvanshrpatil7

PDF

AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdfArtjoker Software Development Company

Make GenAI investments go further with the Dell AI FactoryPrincipled Technologies

OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian

The Future of Mobile Is Context-Aware—Are You Ready?iProgrammer Solutions Private Limited

Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...AgileNetwork

Research-Fundamentals-and-Topic-Development.pdfayesha butalia

Simple and concise overview about Quantum computing..pptxmughal641

Orbitly Pitch Deck｜A Mission-Driven Platform for Side Project Collaboration (...zz41354899

Doc9.....................................SofiaCollazos

Google I/O Extended 2025 Baku - all pptsHusseinMalikMammadli

Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...Sandesh Rao

Introduction to Flutter by Ayush Desai.pptxayushdesai204

Trying to figure out MCP by actually building an app from scratch with open s...Julien SIMON

Security features in Dell, HP, and Lenovo PC systems: A research-based compar...Principled Technologies

What-is-the-World-Wide-Web -- Introductiontonifi9488

AI and Robotics for Human Well-being.pptxJAYMIN SUTHAR

The Future of Artificial Intelligence (AI)Mukul

Brief History of Internet - Early Days of Internetsutharharshit158

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

AI in Daily Life: How Artificial Intelligence Helps Us Every Dayvanshrpatil7

AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdfArtjoker Software Development Company

Big data solution capacity planning

1. Capacity Planning Big Data Solution

2. Hello! I am Riyaz A Shaikh Full Stack Architect You can find me at: @jf @rizAShaikh Riyaz A Shaikh www.riyazshaikh.com

3. Requirement Need to setup analytical and alerting system on data produced by 10,000 servers. Assuming 10 million events generated per day by all servers. Considering 50 GB of data per day.

4. Big Data Cluster Considering Hortonworks Hadoop distribution for cluster setup with following systems.  HDFS for data backup in compressed format.  Spark for data computation and transformation.  Apache Kafka as messaging service for data completeness.  Flume for data capture  Elasticsearch for Analytical data storage and search engine.  Kibana for data visualization

5. Kafka cluster capacity Assumption Size in GB Rationale Daily average raw data ingest rate 50 Kafka retention period of 2 days 100 Raw data * retention period Kafka replication factor of 3 300 Raw data * retention period * replication factor Storage per Day 300 GB Storage per Month This is staging. Monthly calculation is not required because data will be auto purged after retention period. Table 1

6. Elasticsearch cluster capacity Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 Elasticsearch 3 shards 50 Shards are index split. No extra space required. Elasticsearch 3 replica 150 Raw data * replicas Each shards will be replicated 3 times Storage per Day 150 GB Storage per Month 4500 GB Per day * 30 4.5 TB per month Table 2

7. HDFS to backup Elasticsearch data Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 HDFS replication factor by 3 150 Raw data * replication factor 70 % Compression 45 (150 – (150*70/100)) LZO compression Storage per Day 45 GB Storage per Month 1350 GB 1.35 TB per month Table 3

8. Typical Node structure Table 4 Node Structure Typical per data node storage capacity 4 TB 2 X 2 TB HDD Temp space for processing by Spark, Map Reduce etc. 1 TB 25% of the data node Data node usable storage 3 TB Raw storage - Spark reserve Considering storage capacity from above three tables Table 1, Table 2 and Table 3. Total storage required per month is 300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)

9. “ Assuming 10% data growth per quarter. Further, considering 15% year-on-year growth in data volume. Below given Table 5 indicated capacity required as per data growth year-on-year

10. Capacity growth year-on-year Table 5 10% Data Growth Quarterly (Data in TB) Quarter Year 1 Year 2 Year 3 Year 4 Year 5 Q1 6.15 9.4 12.5 16.7 22.2 Q2 6.8 9.9 13.2 17.5 23.3 Q3 7.4 10.4 13.8 18.4 24.5 Q4 8.2 10.9 14.5 19.3 25.7 Yearly storage 28.5 40.6 54.0 71.9 95.7 Data nodes required = yearly storage / Data node usable storage 10 14 18 24 32

11. Hardware Specs Considering one year storage on ten data node with one Namenode and one standby Namenode. Table 6 & 7 shows hardware configuration of each machines.

12. Typical worker node hardware configurations Table 6 Midline configuration (Data Node) CPU 2 × 8 core 2.9 Ghz Memory 64 GB DDR3-1600 ECC Disk controller SAS 6 Gb/s Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.

13. Typical Namenode hardware configurations Table 7 Namenode configuration CPU 2 × 8 core 2.9 Ghz Memory 128 GB Disk controller RAID 1 Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.

14. Thanks! Any questions & feedback! Write to me at: @rizAShaikh [email protected]

Big data solution capacity planning

More Related Content

What's hot (20)

Similar to Big data solution capacity planning (20)

Recently uploaded (20)

Big data solution capacity planning