SlideShare a Scribd company logo
Gartner says about the big data
• Data Volume - Size in GB, TB, PB and ZB
• Data Variety - Type & Structured/Unstructured
• Data Velocity - Store & Retrieve speed
• Data Complexity - How much complex is your data
• Data used for Business Analytics (OLAP)
• Open Source Big Data Framework(Hadoop derived from GFS)
• Built using Sun/Oracle Java language
• Distributed/Parallel computing platform
• Effective of way Fail over handling
• Commodity hardware support to handle big data
• Simple programming model MapReduce (v1/v2) / Hadoop Streaming etc
• Again with old approach - File System with Distributed Storage principle
• Block Structured FS, Isolated Process
• Batch Oriented Job execution
• What is the purpose of a Hadoop?
Hadoop built using sun(oracle)java technology to process the big data
which in terms of GB,TB, etc. size on commodity hardware
• By considering the below resources for parallel process apache designed the system
1. CPU (Central Processing Unit)
2. RAM (Random Access Memory)
3. IO (Input & Output operations)
4. Network Bandwidth (Data transfer over network speed)
Adv:
Low Cost Ownership(LOC)
Highly Scalable Distributed Storage and Process Platform
Fast, Reliable
Self Healing mechanism
Flat Files
XML files
RDBMS
MultipleInputDataSources
Big Data Architecture by Narayana Basetty
Sqoop
Flume,
Kafka
SCM App
Fin App
Bank App
Many Apps
J2EE UI
Tableau
QlikView
SAP WebI
Many UX Apps
SparkMapReduce Python
RDBMS
Data Process Layers
Data Lake  Discover, Prepare & Integrate Biz Data  Final Reports Data
Cold Data SQL
(Sqoop)
Data Discover and
Stage Data(Hive)
Final De-
Norm/Report Data
Stream Data Flow
RDBMS Data Flow
Hot Data (Flume/
Kafka)
Data Discover and stage
Data(Spark Stream) Final De-
Norm/Report Data
Flat/XML etc Flow
Files Dump Discover and Files
process(Hive
External/Parser)
Final De-
Norm/Report Data
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Cluster Infra – Capacity Planning Sample UseCase
Description Volume Calculation System Memory
Daily Ingest Rate 500GB RAM
Replication Factor 3 (copies of
each hdfs
block)
Name Node 1 million blocks = 1GB
NN Memory = TotalClusterStorage-MB/BlockSize-MB
other process RAM
Note: 150bytes per block, MapR no NN
Daily Raw Data
Volume as per
replication
1.5 TB Ingest X Replication Data Node Memory Range
IO bound : 2-4GB, CPU bound : 4-8GB
Node Total
Available Storage
20 TB 10 X 2 TB SSD (SATA II)
OS System
Reserved Space??
10 GB Data Node IO Bound = 4GB * # of physical cores + 2G
4GB Node Manager + 4GB OS
MapReduce Temp
Storage
25% Stage area MapRed/Hive Jobs data Data Node CPU Bound = 8GB * # of physical cores + 2
+ 4GB Node Manager + 4GB OS
Node Raw Usable
Storage
15TB Node Raw Storage – MapRed Reserve Resource Manager Memory = 4GB * # of physical cor
Process + 4GB OS
1 Year(Flat Growth) 37 Nodes (Ingest X Replication X 365)/Node
Raw Usable Storage
Hadoop Cluster Infra – Capacity Planning Sample UseCase
Description High Availability RAM, Cores, Processors
Name Node(s) 3 Compute Nodes, 3 Zoo Keeper 1 TB, 8, 8
Resource Manager
Node(s)
3 Compute Nodes, 3 Zoo Keeper 20 GB, 8, 8
Data Node 100 Compute Nodes (100 * 10 slots * 2 TB) 62 GB(Depends on), 8, 8
Job History Server 2 Compute Nodes 2 TB, 8, 8
OS System
Reserved Space
25 GB 50GB
MapReduce Temp
Storage
20%
Stage area MapRed/Hive Jobs data
Edge Nodes(User) 2 or 4 or 6 Compute Nodes(Depends on users) 64GB, Shared Disks, 8, 8
Ambari 1 Compute Node 15 GB , 8, 8
SQL server(My SQL) 3 (In Replication), 500GB disk 16 GB, 8,8
Rack-1 Rack-2
hddev-c01-r01-03(NN) hddev-c01-r02-08(NN)
hddev-c01-r01-04(RM) hddev-c01-r02-09(RM)
hddev-c01-r01-05(Hive,
Storm)
hddev-c01-r02-10(Hive,
Storm)
hddev-c01-r01-06(Kafka) hddev-c01-r02-11(Kafka)
hddev-c01-r01-07(Hbase) hddev-c01-r02-12
Hadoop Cluster - ACL - Hadoop Admin
Load Balancer
hddev-c01-edge-01 hddev-c01-edge-02
Current Infra Proposed Infra
1. 4 CPU cores, 24 GB
RAM
2. Edge - shared disks
3. Hadoop Disk - SSD
1. 8 CPU cores, 62 GB
RAM
2. Edge - Shared disks
3. Hadoop Disk - SSD
hd - Hadoop, dev - Development, c- compute node, r - rack
Proposed Hadoop Cluster infra
• Ideally 3 Racks ( By considering Failover on rack nodes)
• Master - 3 NN(R1,R2,R3), 3 RM(R1,R2,R3), 3 ZK(R1,R2,R3)
• Slave(Worker) - Remaining Data Node
• Necessary storage disks
• Minimum - 8 cores(preferred), 42 GB RAM (Ex: 10 nodes , 80 cores, 420 GB)
• Max - As needed
• Hortonworks Ambari - chooses automatically services(NN, RM etc) to install on
the nodes. If Admin wanted , that can be customized.
• NN HA, RM HA, Hive HA, ZK HA
NN - Name Node, RM - Resource Manger, R - Rack, ZK - ZooKeeper, HA - High
Availability
Hadoop Cluster - ACL (Unix, Hadoop Admin)
Load Balancer(HD Clients)
hddev-c01-edge-
01(Hive, hdfs, yarn,
spark, flume, pig)
hddev-c01-edge-
02(Hive, hdfs,yarn
spark, flume, pig)
Java - in all compute nodes, hd - Hadoop, dev - Development, c- compute
node, r - rack
Web Server, Kerberos, LDAP, Email, Ambari
(Unix, Hadoop Admin)
appdev-c01-web-01 appdev-c01-web-02
Rack-1 Rack-2 Rack-3
hddev-c01-r01-03(NN) hddev-c01-r02-08(NN) hddev-c01-r03-13(NN)
hddev-c01-r01-04(RM) hddev-c01-r02-09(RM) hddev-c01-r03-14(RM)
hddev-c01-r01-05(Hive,
NM)
hddev-c01-r02-
10(Hive,NM)
hddev-c01-r03-
15(Hive,NM)
hddev-c01-r01-
06(Storm,hue)
hddev-c01-r02-11(storm) hddev-c01-r03-16(Storm)
hddev-c01-r01-07(Kafka) hddev-c01-r02-12(Kafka) hddev-c01-r03-17(Kafka)
Switches
RDBMS(Unix, Hadoop Admin)
hddev-c01-edge-03 hddev-c01-edge-04
Hadoop Basic components
Service Type/Component Master/Slave Component Name HDP Cluster
HDFS Master Name Node Yes
HDFS Master Secondary NN(not
required if NN HA sets up)
Yes
HDFS Slave Data Node Yes
MapRed-2 Master History Server Yes
YARN Master Resource Manager Yes
YARN Slave Node Manager Yes
Quorum Journals Master Journal Nodes(HA NN) Yes
Co-Ordinator Master Zoo Keeper Yes
HDFS/yarn Master Hive(rdbms) Edge Node
Security Master Ranger(rdbms) Yes/Outside
Security Master Knox Yes/Outside
Security Master Kerberos Yes/Outside
Hadoop Good to Have Components
Service Type/Component Master/Slave Component Name HDP Cluster
Master Web Master Hue Yes
Master Web Master Oozie Yes
Spark Master S-Master Yes
Spark Worker S-Worker Yes
Spark Spark On Yarn Yes
HDFS/standalone storage Master HMaster Yes
HDFS/standalone storage Worker Region Server Yes
Component Command-client Sqoop Edge Node
Component Uses by Hive
internal - Client
Tez Edge Node
Component Command-client Pig Edge Node
Hadoop Future Components
Service Type/Component Master/Slave Component Name HDP Cluster
Distributed Message Service Master Kafka Yes
Stream service Master Flume Yes
Stream service Master Flink Yes
Stream service Master Storm Nimbus Yes
Stream service Slave Storm Supervisor Yes
Service Master Solr Yes
Cluster Control Components
Service Type/Component Master/Slave HDP Cluster
Ambari Server Master(rdbms) No
Ambari Agent Slave Yes
RDBMS(mysql/postgresql etc) Master(Replication to be
setup)
No
Ambari Metrics Master(rdbms) No
Install High level basic steps
• 1) Install Linux OS (CentOS 6.X), user accounts enabled/created: root/welcome1, hdpadmin/welcome1
Hadoop Admin GROUPID and USERID minimum should be 2000
• 2) Ensure SSH, python2.6 package installed running if not then install using root access
• 3) Setup hostname(FQDN) maps to IP using hosts file
• 4) Setup hostname or update the hostname entry network file
• 5) Tune kernel params, Disable IPv6 & transparent memory pages
• 6) Disable selinux
• 7) Ensure iptables or firewall off for demo setup[Later you can enable and allow required ports and access]
• 8) Ensure DNS name resolution to IP Address and vice versa
• 9) Install NTP service and assign to the region of NTP server
• 10) Hard and Soft limits for files, umask
• 11) Reboot
• 12) Hadoop Admin account must be part of sudoers list
• 13) Enable password less authentication using ssh

More Related Content

What's hot (20)

PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
PDF
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
PPTX
Hadoop configuration & performance tuning
Vitthal Gogate
 
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
PDF
Improving Hadoop Performance via Linux
Alex Moundalexis
 
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
 
PDF
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
PDF
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
PDF
How to Increase Performance of Your Hadoop Cluster
Altoros
 
PDF
HBase Sizing Guide
larsgeorge
 
PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PDF
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
PPTX
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
PPT
ha_module5
Gurmukh Singh
 
PDF
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
PPTX
Backup and Disaster Recovery in Hadoop
larsgeorge
 
PDF
Apache HBase 1.0 Release
Nick Dimiduk
 
PDF
Apache Hadoop 0.22 and Other Versions
Konstantin V. Shvachko
 
PPTX
Introduction to hadoop high availability
Omid Vahdaty
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Hadoop configuration & performance tuning
Vitthal Gogate
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Improving Hadoop Performance via Linux
Alex Moundalexis
 
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
 
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
How to Increase Performance of Your Hadoop Cluster
Altoros
 
HBase Sizing Guide
larsgeorge
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
ha_module5
Gurmukh Singh
 
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Apache HBase 1.0 Release
Nick Dimiduk
 
Apache Hadoop 0.22 and Other Versions
Konstantin V. Shvachko
 
Introduction to hadoop high availability
Omid Vahdaty
 

Similar to Hadoop Architecture_Cluster_Cap_Plan (20)

PDF
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
PDF
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
PDF
Hortonworks HDP, Is it goog enough ?
Huxi LI
 
DOCX
Hadoop Research
Shreyansh Ajit kumar
 
PPTX
Hadoop and BigData - July 2016
Ranjith Sekar
 
PPTX
Intro to hadoop
Haden Pereira
 
PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPTX
HBase Operations and Best Practices
Venu Anuganti
 
PPTX
Managing growth in Production Hadoop Deployments
DataWorks Summit
 
PPTX
Big data
Abilash Mavila
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PDF
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
PPTX
Hadoop AWS infrastructure cost evaluation
mattlieber
 
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Hortonworks HDP, Is it goog enough ?
Huxi LI
 
Hadoop Research
Shreyansh Ajit kumar
 
Hadoop and BigData - July 2016
Ranjith Sekar
 
Intro to hadoop
Haden Pereira
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
HBase Operations and Best Practices
Venu Anuganti
 
Managing growth in Production Hadoop Deployments
DataWorks Summit
 
Big data
Abilash Mavila
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Hadoop AWS infrastructure cost evaluation
mattlieber
 
Ad

Recently uploaded (20)

PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Python basic programing language for automation
DanialHabibi2
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Ad

Hadoop Architecture_Cluster_Cap_Plan

  • 1. Gartner says about the big data • Data Volume - Size in GB, TB, PB and ZB • Data Variety - Type & Structured/Unstructured • Data Velocity - Store & Retrieve speed • Data Complexity - How much complex is your data • Data used for Business Analytics (OLAP) • Open Source Big Data Framework(Hadoop derived from GFS) • Built using Sun/Oracle Java language • Distributed/Parallel computing platform • Effective of way Fail over handling • Commodity hardware support to handle big data • Simple programming model MapReduce (v1/v2) / Hadoop Streaming etc • Again with old approach - File System with Distributed Storage principle • Block Structured FS, Isolated Process • Batch Oriented Job execution
  • 2. • What is the purpose of a Hadoop? Hadoop built using sun(oracle)java technology to process the big data which in terms of GB,TB, etc. size on commodity hardware • By considering the below resources for parallel process apache designed the system 1. CPU (Central Processing Unit) 2. RAM (Random Access Memory) 3. IO (Input & Output operations) 4. Network Bandwidth (Data transfer over network speed) Adv: Low Cost Ownership(LOC) Highly Scalable Distributed Storage and Process Platform Fast, Reliable Self Healing mechanism
  • 3. Flat Files XML files RDBMS MultipleInputDataSources Big Data Architecture by Narayana Basetty Sqoop Flume, Kafka SCM App Fin App Bank App Many Apps J2EE UI Tableau QlikView SAP WebI Many UX Apps SparkMapReduce Python RDBMS
  • 4. Data Process Layers Data Lake  Discover, Prepare & Integrate Biz Data  Final Reports Data Cold Data SQL (Sqoop) Data Discover and Stage Data(Hive) Final De- Norm/Report Data Stream Data Flow RDBMS Data Flow Hot Data (Flume/ Kafka) Data Discover and stage Data(Spark Stream) Final De- Norm/Report Data Flat/XML etc Flow Files Dump Discover and Files process(Hive External/Parser) Final De- Norm/Report Data
  • 6. Hadoop Cluster Infra – Capacity Planning Sample UseCase Description Volume Calculation System Memory Daily Ingest Rate 500GB RAM Replication Factor 3 (copies of each hdfs block) Name Node 1 million blocks = 1GB NN Memory = TotalClusterStorage-MB/BlockSize-MB other process RAM Note: 150bytes per block, MapR no NN Daily Raw Data Volume as per replication 1.5 TB Ingest X Replication Data Node Memory Range IO bound : 2-4GB, CPU bound : 4-8GB Node Total Available Storage 20 TB 10 X 2 TB SSD (SATA II) OS System Reserved Space?? 10 GB Data Node IO Bound = 4GB * # of physical cores + 2G 4GB Node Manager + 4GB OS MapReduce Temp Storage 25% Stage area MapRed/Hive Jobs data Data Node CPU Bound = 8GB * # of physical cores + 2 + 4GB Node Manager + 4GB OS Node Raw Usable Storage 15TB Node Raw Storage – MapRed Reserve Resource Manager Memory = 4GB * # of physical cor Process + 4GB OS 1 Year(Flat Growth) 37 Nodes (Ingest X Replication X 365)/Node Raw Usable Storage
  • 7. Hadoop Cluster Infra – Capacity Planning Sample UseCase Description High Availability RAM, Cores, Processors Name Node(s) 3 Compute Nodes, 3 Zoo Keeper 1 TB, 8, 8 Resource Manager Node(s) 3 Compute Nodes, 3 Zoo Keeper 20 GB, 8, 8 Data Node 100 Compute Nodes (100 * 10 slots * 2 TB) 62 GB(Depends on), 8, 8 Job History Server 2 Compute Nodes 2 TB, 8, 8 OS System Reserved Space 25 GB 50GB MapReduce Temp Storage 20% Stage area MapRed/Hive Jobs data Edge Nodes(User) 2 or 4 or 6 Compute Nodes(Depends on users) 64GB, Shared Disks, 8, 8 Ambari 1 Compute Node 15 GB , 8, 8 SQL server(My SQL) 3 (In Replication), 500GB disk 16 GB, 8,8
  • 8. Rack-1 Rack-2 hddev-c01-r01-03(NN) hddev-c01-r02-08(NN) hddev-c01-r01-04(RM) hddev-c01-r02-09(RM) hddev-c01-r01-05(Hive, Storm) hddev-c01-r02-10(Hive, Storm) hddev-c01-r01-06(Kafka) hddev-c01-r02-11(Kafka) hddev-c01-r01-07(Hbase) hddev-c01-r02-12 Hadoop Cluster - ACL - Hadoop Admin Load Balancer hddev-c01-edge-01 hddev-c01-edge-02 Current Infra Proposed Infra 1. 4 CPU cores, 24 GB RAM 2. Edge - shared disks 3. Hadoop Disk - SSD 1. 8 CPU cores, 62 GB RAM 2. Edge - Shared disks 3. Hadoop Disk - SSD hd - Hadoop, dev - Development, c- compute node, r - rack
  • 9. Proposed Hadoop Cluster infra • Ideally 3 Racks ( By considering Failover on rack nodes) • Master - 3 NN(R1,R2,R3), 3 RM(R1,R2,R3), 3 ZK(R1,R2,R3) • Slave(Worker) - Remaining Data Node • Necessary storage disks • Minimum - 8 cores(preferred), 42 GB RAM (Ex: 10 nodes , 80 cores, 420 GB) • Max - As needed • Hortonworks Ambari - chooses automatically services(NN, RM etc) to install on the nodes. If Admin wanted , that can be customized. • NN HA, RM HA, Hive HA, ZK HA NN - Name Node, RM - Resource Manger, R - Rack, ZK - ZooKeeper, HA - High Availability
  • 10. Hadoop Cluster - ACL (Unix, Hadoop Admin) Load Balancer(HD Clients) hddev-c01-edge- 01(Hive, hdfs, yarn, spark, flume, pig) hddev-c01-edge- 02(Hive, hdfs,yarn spark, flume, pig) Java - in all compute nodes, hd - Hadoop, dev - Development, c- compute node, r - rack Web Server, Kerberos, LDAP, Email, Ambari (Unix, Hadoop Admin) appdev-c01-web-01 appdev-c01-web-02 Rack-1 Rack-2 Rack-3 hddev-c01-r01-03(NN) hddev-c01-r02-08(NN) hddev-c01-r03-13(NN) hddev-c01-r01-04(RM) hddev-c01-r02-09(RM) hddev-c01-r03-14(RM) hddev-c01-r01-05(Hive, NM) hddev-c01-r02- 10(Hive,NM) hddev-c01-r03- 15(Hive,NM) hddev-c01-r01- 06(Storm,hue) hddev-c01-r02-11(storm) hddev-c01-r03-16(Storm) hddev-c01-r01-07(Kafka) hddev-c01-r02-12(Kafka) hddev-c01-r03-17(Kafka) Switches RDBMS(Unix, Hadoop Admin) hddev-c01-edge-03 hddev-c01-edge-04
  • 11. Hadoop Basic components Service Type/Component Master/Slave Component Name HDP Cluster HDFS Master Name Node Yes HDFS Master Secondary NN(not required if NN HA sets up) Yes HDFS Slave Data Node Yes MapRed-2 Master History Server Yes YARN Master Resource Manager Yes YARN Slave Node Manager Yes Quorum Journals Master Journal Nodes(HA NN) Yes Co-Ordinator Master Zoo Keeper Yes HDFS/yarn Master Hive(rdbms) Edge Node Security Master Ranger(rdbms) Yes/Outside Security Master Knox Yes/Outside Security Master Kerberos Yes/Outside
  • 12. Hadoop Good to Have Components Service Type/Component Master/Slave Component Name HDP Cluster Master Web Master Hue Yes Master Web Master Oozie Yes Spark Master S-Master Yes Spark Worker S-Worker Yes Spark Spark On Yarn Yes HDFS/standalone storage Master HMaster Yes HDFS/standalone storage Worker Region Server Yes Component Command-client Sqoop Edge Node Component Uses by Hive internal - Client Tez Edge Node Component Command-client Pig Edge Node
  • 13. Hadoop Future Components Service Type/Component Master/Slave Component Name HDP Cluster Distributed Message Service Master Kafka Yes Stream service Master Flume Yes Stream service Master Flink Yes Stream service Master Storm Nimbus Yes Stream service Slave Storm Supervisor Yes Service Master Solr Yes Cluster Control Components Service Type/Component Master/Slave HDP Cluster Ambari Server Master(rdbms) No Ambari Agent Slave Yes RDBMS(mysql/postgresql etc) Master(Replication to be setup) No Ambari Metrics Master(rdbms) No
  • 14. Install High level basic steps • 1) Install Linux OS (CentOS 6.X), user accounts enabled/created: root/welcome1, hdpadmin/welcome1 Hadoop Admin GROUPID and USERID minimum should be 2000 • 2) Ensure SSH, python2.6 package installed running if not then install using root access • 3) Setup hostname(FQDN) maps to IP using hosts file • 4) Setup hostname or update the hostname entry network file • 5) Tune kernel params, Disable IPv6 & transparent memory pages • 6) Disable selinux • 7) Ensure iptables or firewall off for demo setup[Later you can enable and allow required ports and access] • 8) Ensure DNS name resolution to IP Address and vice versa • 9) Install NTP service and assign to the region of NTP server • 10) Hard and Soft limits for files, umask • 11) Reboot • 12) Hadoop Admin account must be part of sudoers list • 13) Enable password less authentication using ssh