SlideShare a Scribd company logo
Capacity Planning
Big Data Solution
Hello!
I am Riyaz A Shaikh
Full Stack Architect
You can find me at:
@jf @rizAShaikh
Riyaz A Shaikh
www.riyazshaikh.com
Requirement
Need to setup analytical and alerting system
on data produced by 10,000 servers.
Assuming 10 million events generated per
day by all servers. Considering 50 GB of data
per day.
Big Data Cluster
Considering Hortonworks Hadoop
distribution for cluster setup with following
systems.
 HDFS for data backup in compressed
format.
 Spark for data computation and
transformation.
 Apache Kafka as messaging service for
data completeness.
 Flume for data capture
 Elasticsearch for Analytical data storage
and search engine.
 Kibana for data visualization
Kafka cluster capacity
Assumption Size in GB Rationale
Daily average raw data ingest rate 50
Kafka retention period of 2 days 100 Raw data * retention period
Kafka replication factor of 3 300 Raw data * retention period * replication factor
Storage per Day 300 GB
Storage per Month
This is staging. Monthly calculation is not required because
data will be auto purged after retention period.
Table 1
Elasticsearch cluster capacity
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
Elasticsearch 3 shards 50
Shards are index split. No
extra space required.
Elasticsearch 3 replica 150 Raw data * replicas
Each shards will be
replicated 3 times
Storage per Day 150 GB
Storage per Month 4500 GB Per day * 30 4.5 TB per month
Table 2
HDFS to backup Elasticsearch data
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
HDFS replication factor by 3 150 Raw data * replication factor
70 % Compression 45 (150 – (150*70/100)) LZO compression
Storage per Day 45 GB
Storage per Month 1350 GB 1.35 TB per month
Table 3
Typical Node structure
Table 4
Node Structure
Typical per data node storage capacity
4 TB 2 X 2 TB HDD
Temp space for processing by Spark,
Map Reduce etc. 1 TB 25% of the data node
Data node usable storage
3 TB
Raw storage - Spark
reserve
Considering storage capacity from above three tables
Table 1, Table 2 and Table 3.
Total storage required per month is
300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
“
Assuming 10% data growth per quarter. Further, considering
15% year-on-year growth in data volume.
Below given Table 5 indicated capacity required as per data
growth year-on-year
Capacity growth year-on-year
Table 5
10% Data Growth Quarterly (Data in TB)
Quarter Year 1 Year 2 Year 3 Year 4 Year 5
Q1 6.15 9.4 12.5 16.7 22.2
Q2 6.8 9.9 13.2 17.5 23.3
Q3 7.4 10.4 13.8 18.4 24.5
Q4 8.2 10.9 14.5 19.3 25.7
Yearly storage 28.5 40.6 54.0 71.9 95.7
Data nodes required =
yearly storage / Data node usable storage
10 14 18 24 32
Hardware Specs
Considering one year storage on ten data
node with one Namenode and one standby
Namenode.
Table 6 & 7 shows hardware configuration of
each machines.
Typical worker node hardware configurations
Table 6
Midline configuration (Data Node)
CPU 2 × 8 core 2.9 Ghz
Memory 64 GB DDR3-1600 ECC
Disk controller SAS 6 Gb/s
Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Typical Namenode hardware configurations
Table 7
Namenode configuration
CPU 2 × 8 core 2.9 Ghz
Memory 128 GB
Disk controller RAID 1
Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Thanks!
Any questions & feedback!
Write to me at:
@rizAShaikh
Shaikh.r.a@gmail.com

More Related Content

PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
 
PDF
Hadoop - Simple. Scalable.
elliando dias
 
PDF
Beginner Apache Spark Presentation
Nidhin Pattaniyil
 
KEY
Hadoop導入事例 in クックパッド
Tatsuya Sasaki
 
PPTX
Hadoop
Jaydeep Patel
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Hadoop and Hive Development at Facebook
elliando dias
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
 
Hadoop - Simple. Scalable.
elliando dias
 
Beginner Apache Spark Presentation
Nidhin Pattaniyil
 
Hadoop導入事例 in クックパッド
Tatsuya Sasaki
 

What's hot (20)

PDF
Improve Presto Architectural Decisions with Shadow Cache
Alluxio, Inc.
 
PPT
Coriani 2
Innocenti Andrea
 
PPT
MongoDB @ fliptop
Robbie Cheng
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Can the elephants handle the no sql onslaught
Aung Thu Rha Hein
 
PPTX
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
PDF
PgconfSV compression
Anastasia Lubennikova
 
PDF
Hadoop Architecture in Depth
Syed Hadoop
 
PDF
Introduction to Hadoop - FinistJug
David Morin
 
PPTX
R&D for L&D
Megan Bowe
 
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
InfluxData
 
PPT
SUNY Ulster - GIS Program Server Storage Options
Michael Dobe, Ph.D.
 
PDF
Hadoop breizhjug
David Morin
 
KEY
Getting Started on Hadoop
Paco Nathan
 
PDF
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
PDF
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike
 
PDF
20170210 sapporotechbar7
Ryuji Tamagawa
 
Improve Presto Architectural Decisions with Shadow Cache
Alluxio, Inc.
 
Coriani 2
Innocenti Andrea
 
MongoDB @ fliptop
Robbie Cheng
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Can the elephants handle the no sql onslaught
Aung Thu Rha Hein
 
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
PgconfSV compression
Anastasia Lubennikova
 
Hadoop Architecture in Depth
Syed Hadoop
 
Introduction to Hadoop - FinistJug
David Morin
 
R&D for L&D
Megan Bowe
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
InfluxData
 
SUNY Ulster - GIS Program Server Storage Options
Michael Dobe, Ph.D.
 
Hadoop breizhjug
David Morin
 
Getting Started on Hadoop
Paco Nathan
 
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike
 
20170210 sapporotechbar7
Ryuji Tamagawa
 
Ad

Similar to Big data solution capacity planning (20)

PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
PPTX
Tame that Beast
DataWorks Summit/Hadoop Summit
 
PDF
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
PDF
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
PDF
Big Data Architecture and Deployment
Cisco Canada
 
DOCX
Hadoop Research
Shreyansh Ajit kumar
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PPTX
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
PDF
Hadoop 101
EMC
 
PDF
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Stuart Pook
 
PDF
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Community
 
PPT
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
PDF
Security sizing meetup
Daliya Spasova
 
PPT
My other computer is a datacentre - 2012 edition
Steve Loughran
 
PPTX
Democratizing Memory Storage
DataWorks Summit
 
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
PDF
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
 
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Cisco Canada
 
Hadoop Research
Shreyansh Ajit kumar
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
Hadoop 101
EMC
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Stuart Pook
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Community
 
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
Security sizing meetup
Daliya Spasova
 
My other computer is a datacentre - 2012 edition
Steve Loughran
 
Democratizing Memory Storage
DataWorks Summit
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Doc9.....................................
SofiaCollazos
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
The Future of Artificial Intelligence (AI)
Mukul
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 

Big data solution capacity planning

  • 2. Hello! I am Riyaz A Shaikh Full Stack Architect You can find me at: @jf @rizAShaikh Riyaz A Shaikh www.riyazshaikh.com
  • 3. Requirement Need to setup analytical and alerting system on data produced by 10,000 servers. Assuming 10 million events generated per day by all servers. Considering 50 GB of data per day.
  • 4. Big Data Cluster Considering Hortonworks Hadoop distribution for cluster setup with following systems.  HDFS for data backup in compressed format.  Spark for data computation and transformation.  Apache Kafka as messaging service for data completeness.  Flume for data capture  Elasticsearch for Analytical data storage and search engine.  Kibana for data visualization
  • 5. Kafka cluster capacity Assumption Size in GB Rationale Daily average raw data ingest rate 50 Kafka retention period of 2 days 100 Raw data * retention period Kafka replication factor of 3 300 Raw data * retention period * replication factor Storage per Day 300 GB Storage per Month This is staging. Monthly calculation is not required because data will be auto purged after retention period. Table 1
  • 6. Elasticsearch cluster capacity Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 Elasticsearch 3 shards 50 Shards are index split. No extra space required. Elasticsearch 3 replica 150 Raw data * replicas Each shards will be replicated 3 times Storage per Day 150 GB Storage per Month 4500 GB Per day * 30 4.5 TB per month Table 2
  • 7. HDFS to backup Elasticsearch data Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 HDFS replication factor by 3 150 Raw data * replication factor 70 % Compression 45 (150 – (150*70/100)) LZO compression Storage per Day 45 GB Storage per Month 1350 GB 1.35 TB per month Table 3
  • 8. Typical Node structure Table 4 Node Structure Typical per data node storage capacity 4 TB 2 X 2 TB HDD Temp space for processing by Spark, Map Reduce etc. 1 TB 25% of the data node Data node usable storage 3 TB Raw storage - Spark reserve Considering storage capacity from above three tables Table 1, Table 2 and Table 3. Total storage required per month is 300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
  • 9. “ Assuming 10% data growth per quarter. Further, considering 15% year-on-year growth in data volume. Below given Table 5 indicated capacity required as per data growth year-on-year
  • 10. Capacity growth year-on-year Table 5 10% Data Growth Quarterly (Data in TB) Quarter Year 1 Year 2 Year 3 Year 4 Year 5 Q1 6.15 9.4 12.5 16.7 22.2 Q2 6.8 9.9 13.2 17.5 23.3 Q3 7.4 10.4 13.8 18.4 24.5 Q4 8.2 10.9 14.5 19.3 25.7 Yearly storage 28.5 40.6 54.0 71.9 95.7 Data nodes required = yearly storage / Data node usable storage 10 14 18 24 32
  • 11. Hardware Specs Considering one year storage on ten data node with one Namenode and one standby Namenode. Table 6 & 7 shows hardware configuration of each machines.
  • 12. Typical worker node hardware configurations Table 6 Midline configuration (Data Node) CPU 2 × 8 core 2.9 Ghz Memory 64 GB DDR3-1600 ECC Disk controller SAS 6 Gb/s Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 13. Typical Namenode hardware configurations Table 7 Namenode configuration CPU 2 × 8 core 2.9 Ghz Memory 128 GB Disk controller RAID 1 Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 14. Thanks! Any questions & feedback! Write to me at: @rizAShaikh [email protected]