SlideShare a Scribd company logo
Setting up hadoop cluster
Cluster Specification
• Hadoop is designed to run on commodity hardware.
• Not tied to expensive; rather commonly available hardware from any of a large
range of vendors to build your cluster
• “Commodity” does not mean “low-end.”
• Low-end machines often have cheap components, which have higher failure
rates than more expensive (but still commodity class) machines
• When you are operating tens, hundreds, or thousands of machines, cheap
components turn out to be a false economy, as the higher failure rate incurs a
greater maintenance cost.
• On the other hand, large database class machines are not recommended either,
since they don’t score well on the price/performance curve.
• Hardware specifications rapidly become
obsolete, would have the following
specifications:
• Processor 2 quad-core 2-2.5GHz CPUs
• Memory 16-24 GB ECC RAM
• Storage 4 × 1TB SATA disks
• Network Gigabit Ethernet
Why Not Use RAID?
• HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for
datanode storage
• The redundancy that RAID provides is not needed, since HDFS handles it by replication between
nodes.
• RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower
than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS
blocks between all disks.
• The reason for this is that RAID 0 read and write operations are limited by the speed of the slowest
disk in the RAID array. In JBOD, disk operations are independent, so the average speed of
operations is greater than that of the slowest disk.
• JBOD performed 10% faster than RAID 0 in one test , and 30% better in another (HDFS write
throughput).
• Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed
disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to
become unavailable.
Setting_up_hadoop_cluster_Detailed-overview
Network Topology
• A common Hadoop cluster architecture consists of a two-level
network topology,
• Typically there are 30 to 40 servers per rack, with a 1 GB switch
for the rack (only three are shown in the diagram), and an uplink
to a core switch or router (which is normally 1 GB or better).
• The salient point is that the aggregate bandwidth between nodes
on the same rack is much greater than that between nodes on
different racks
Setting_up_hadoop_cluster_Detailed-overview
Rack awareness
• To get maximum performance out of Hadoop, it is important to configure
Hadoop so that it knows the topology of your network.
• If your cluster runs on a single rack, then there is nothing more to do,
since this is the default.
• However, for multirack clusters, you need to map nodes to racks.
• By doing this, Hadoop will prefer within-rack transfers (where there is
more bandwidth available) to off-rack transfers when placing MapReduce
tasks on nodes.
• HDFS will be able to place replicas more intelligently to trade-off
performance
• Network locations such as nodes and racks are represented in a tree, which
reflects the network “distance” between locations.
• The namenode uses the network location when determining where to place
block replicas, the MapReduce scheduler uses network location to
determine where the closest replica is as input to a map task.
• For the network , the rack topology is described by two network locations,
say, /switch1/rack1 and /switch1/rack2.
• Since there is only one top-level switch in this cluster, the locations can be
simplified to /rack1 and /rack2
• The Hadoop configuration must specify a map between node addresses
and network locations.
• The map is described by a Java interface, DNSToSwitchMapping,
whose signature is:
public interface DNSToSwitchMapping
{ public List resolve(List names);
}
• The names parameter is a list of IP addresses, and the return value is a
list of corresponding network location strings.
Cluster Setup and installation
• Ssh configuration
• The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide operations.
• For example, there is a script for stopping and starting all the daemons in the
cluster.
• To work seamlessly, SSH needs to be set up to allow password-less login for the hadoop user from
machines in the cluster
• The simplest way to achieve this is to generate a public/private key pair, and place it in an NFS
location that is shared across the cluster.
• First, generate an RSA key pair by typing the following in the hadoop user account:
• % ssh-keygen -t rsa -f ~/.ssh/id_rsa
• Even though we want password-less logins, keys without passphrases
are not considered good practice, specify a passphrase when prompted
for one.
• We shall use ssh-agent to avoid the need to enter a password for each
connection.
• The private key is in the file specified by the -f option, ~/.ssh/id_rsa,
and the public key is stored in a file with the same name with .pub
appended, ~/.ssh/id_rsa.pub.
• Next we need to make sure that the public key is in the
~/.ssh/authorized_keys file on all the machines in the cluster
that we want to connect to.
• If the hadoop user’s home directory is an NFS filesystem,
then the keys can be shared across the cluster by typing:
• % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
• If the home directory is not shared using NFS, then the
public keys will need to be shared by some other means.
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Hadoop Configuration
Environment Settings - Memory
• System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
• This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
• A common choice is /var/log/hadoop, set by including line in hadoop-env.sh:
export HADOOP_LOG_DIR=/var/log/hadoop
• The log directory will be created if it doesn’t already exist.
• Each Hadoop daemon running on a machine produces two logfiles.
• The first is the log output written via log4j. This file, which ends in .log, should be the
first port of call when diagnosing problems, since most application log messages are
written here.
• Periodically old files are deleted
• The second logfile is the combined standard output and standard error log.
• This logfile, which ends in .out, usually contains little or no output, since
Hadoop uses log4j for logging. It is only rotated when the daemon is
restarted, and only the last five logs are retained. Old logfiles are suffixed
with a number between 1 and 5, with 5 being the oldest file.
HDFS Daemon Properties
Mapreduce Daemon Properties
HTTP Server Properties
Security
• HDFS file permissions provide only a mechanism for
Authorization, which controls what a particular user
can do to a particular file
• But authentication ? --- secure
• Authentication is a mechanism to assure in Hadoop
that the user seeking to perform an operation on the
cluster is who they claim to be and therefore trusted
Security
• the file permissions system in HDFS prevents one user from accidentally
wiping out the whole filesystem from a bug in a program, or by mistakenly
typing
• hadoop fs -rmr /
, but it doesn’t prevent a malicious user from assuming root’s identity to access or
delete any data in the cluster
• Kerberos, a mature open-source network
authentication protocol, to authenticate the
user.
• In turn, Kerberos doesn’t manage permissions.
• Kerberos says that a user is who they say they
are; it’s Hadoop’s job to determine whether
that user has permission to perform a given
action
24
Kerberos 4 Overview
• a basic third-party authentication scheme
• have an Authentication Server (AS)
– users initially negotiate with AS to identify
themselves
– AS provides a non-corruptible authentication
credential (ticket granting ticket TGT)
• have a Ticket Granting server (TGS)
– users subsequently request access to other
services from TGS on basis of users TGT
Kerberos and Hadoop
At a high level, there are three steps that a client must take to access a service when
using Kerberos, each of which involves a message exchange with a server:
1.
Authentication.
The client authenticates itself to the Authentication Server and
receives a timestamped Ticket-Granting Ticket (TGT).
2.
Authorization.
The client uses the TGT to request a service ticket from the Ticket
Granting Server.
3.
Service Request.
The client uses the service ticket to authenticate itself to the server
that is providing the service the client is using. In the case of Hadoop, this might
be the namenode or the jobtracker.
Together, the Authentication Server and the Ticket Granting Server form the
Key Dis-
tribution Center
(KDC).
26
Authentication Dialogue II
• Once per user logon session
• (1) C -> AS: IDC || IDTGS
• (2) AS -> C: E K(C) [TicketTGS]
• TicketTGS is equal to
– E K(TGS) [IDC || ADC || IDTGS
|| TS1 || Lifetime1 ]
27
Explaining the fields
• TGS = Ticket-granting server
• IDTGS = Identifier of the TGS
• TicketTGS = Ticket-granting ticket or TGT
• TS1 = timestamp
• Lifetime1 = lifetime for the TGT
• K (C) = key derived from user’s password
28
Messages (3) and (4)
• Once per type of service
• (3) C -> TGS: IDC || IDV || TicketTGS
• (4) TGS -> C : TicketV
• TicketV is equal to
– E K(V) [ IDC || ADC || IDV ||
TS2 || Lifetime2 ]
K(V): key shared between V and TGS
Is called the service-granting ticket (SGT)
29
Message 5
• Once per service session
• (5) C -> V: IDC || TicketV
• C says to V “I am IDC and have a ticket from
the TGS” . Let me in!
• Seems secure, but..
– There are problems
30
Version 4 Authentication Dialogue
Authentication Service Exhange: To obtain Ticket-Granting Ticket
(1) C  AS: IDc || IDtgs ||TS1
(2) AS  C: EKc [Kc,tgs|| IDtgs || TS2 || Lifetime2 || Tickettgs]
Ticket-Granting Service Echange: To obtain Service-Granting Ticket
(3) C  TGS: IDv ||Tickettgs ||Authenticatorc
(4) TGS  C: EKc [Kc,¨v|| IDv || TS4 || Ticketv]
Client/Server Authentication Exhange: To Obtain Service
(5) C  V: Ticketv || Authenticatorc
(6) V  C: EKc,v[TS5 +1]
Setting_up_hadoop_cluster_Detailed-overview
Kerberos and Hadoop
• 3 steps
• 1. Authentication : C -> AS & AS-> C [TGT with
timestamp]
• 2.Authorisation: C->TGS [TGT to request a service
ticket]
• 3. Service Request : C uses service ticket to
authenticate itself to server for using service.
• [ In hadoop can be access of
namenode/jobtracker]
• The authorization and service request steps are not user-level actions: the client
performs these steps on the user’s behalf.
• The authentication step, however, is normally carried out explicitly by the user
using the kinit command, which will prompt for a password.
• However, this doesn’t mean you need to enter your password every time you run a
job or access HDFS, since TGTs last for 10 hours by default (and can be renewed
for up to a week).
• In cases where you don’t want to be prompted for a password (for running an
unattended MapReduce job, for example), you can create a Kerberos keytab file
using the ktutil command. A keytab is a file that stores passwords and may be
supplied to kinit with the -t option
Setting_up_hadoop_cluster_Detailed-overview
• The first step is to enable Kerberos authentication by
setting the
• hadoop.security.authentication property in core-site.xml to
kerberos.
• The default setting is simple
• To enable service-level authorization by setting
hadoop.security.authorization to true in the same file.
• You may configure Access Control Lists (ACLs) in the
hadoop-policy.xml configuration file to control which users
and groups have permission to connect to each Hadoop
service
Delegation Token
• HDFS read operation will involve multiple calls to the namenode and calls
to one or more datanodes.
• Instead of using the three-step Kerberos ticket exchange protocol to
authenticate each call,
• So high load on the KDC on a busy cluster, Hadoop uses delegation tokens
to allow later authenticated access without having to contact the KDC
again.
• A delegation token is generated by the server (the namenode in this case),
and can be thought of as a shared secret between the client and the server.
• On the first RPC call to the namenode, the client has no delegation token,
so it uses Kerberos to authenticate,
• and as a part of the response it gets a delegation token from the namenode.
In subsequent calls, it presents the delegation token, which the namenode
can verify (since it generated it using a secret key), and hence the client is
authenticated to the server.
• When it wants to perform operations on HDFS blocks, the client uses a special kind of
delegation token, called a block access token, that the namenode passes to the client in
response to a metadata request.
• The client uses the block access token to authenticate itself to datanodes.
• A HDFS block may only be accessed by a client with a valid block access token from a
namenode.
• This closes the security hole in unsecured Hadoop where only the block ID was needed
to gain access to a block.
• This property is enabled by setting dfs.block.access.token.enable to true.
• Delegation tokens are used by the jobtracker and tasktrackers to access HDFS
• When the job has finished, the delegation tokens are invalidated.

More Related Content

Similar to Setting_up_hadoop_cluster_Detailed-overview (20)

DOC
Hadoop cluster configuration
prabakaranbrick
 
PDF
Linux cgroups and namespaces
Locaweb
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
DAS RAID NAS SAN
Ghassen Smida
 
PPTX
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PDF
Lecture 2 part 1
Jazan University
 
PDF
Gpfs introandsetup
asihan
 
PPTX
Unit 5
Ravi Kumar
 
PDF
Hadoop data management
Subhas Kumar Ghosh
 
PDF
How to Build a Compute Cluster
Ramsay Key
 
PPT
2.1 Red_Hat_Cluster1.ppt
Manoj603126
 
PDF
linux installation.pdf
MuhammadShoaibHussai2
 
PPTX
HPC and cloud distributed computing, as a journey
Peter Clapham
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PPTX
Rhel cluster basics 2
Manoj Singh
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
PDF
Hbase 20141003
Jean-Baptiste Poullet
 
Hadoop cluster configuration
prabakaranbrick
 
Linux cgroups and namespaces
Locaweb
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
DAS RAID NAS SAN
Ghassen Smida
 
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Lecture 2 part 1
Jazan University
 
Gpfs introandsetup
asihan
 
Unit 5
Ravi Kumar
 
Hadoop data management
Subhas Kumar Ghosh
 
How to Build a Compute Cluster
Ramsay Key
 
2.1 Red_Hat_Cluster1.ppt
Manoj603126
 
linux installation.pdf
MuhammadShoaibHussai2
 
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Rhel cluster basics 2
Manoj Singh
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
Hbase 20141003
Jean-Baptiste Poullet
 

Recently uploaded (20)

PDF
July Patch Tuesday
Ivanti
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Ad

Setting_up_hadoop_cluster_Detailed-overview

  • 2. Cluster Specification • Hadoop is designed to run on commodity hardware. • Not tied to expensive; rather commonly available hardware from any of a large range of vendors to build your cluster • “Commodity” does not mean “low-end.” • Low-end machines often have cheap components, which have higher failure rates than more expensive (but still commodity class) machines • When you are operating tens, hundreds, or thousands of machines, cheap components turn out to be a false economy, as the higher failure rate incurs a greater maintenance cost. • On the other hand, large database class machines are not recommended either, since they don’t score well on the price/performance curve.
  • 3. • Hardware specifications rapidly become obsolete, would have the following specifications: • Processor 2 quad-core 2-2.5GHz CPUs • Memory 16-24 GB ECC RAM • Storage 4 × 1TB SATA disks • Network Gigabit Ethernet
  • 4. Why Not Use RAID? • HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for datanode storage • The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes. • RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS blocks between all disks. • The reason for this is that RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array. In JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk. • JBOD performed 10% faster than RAID 0 in one test , and 30% better in another (HDFS write throughput). • Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to become unavailable.
  • 6. Network Topology • A common Hadoop cluster architecture consists of a two-level network topology, • Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). • The salient point is that the aggregate bandwidth between nodes on the same rack is much greater than that between nodes on different racks
  • 8. Rack awareness • To get maximum performance out of Hadoop, it is important to configure Hadoop so that it knows the topology of your network. • If your cluster runs on a single rack, then there is nothing more to do, since this is the default. • However, for multirack clusters, you need to map nodes to racks. • By doing this, Hadoop will prefer within-rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. • HDFS will be able to place replicas more intelligently to trade-off performance
  • 9. • Network locations such as nodes and racks are represented in a tree, which reflects the network “distance” between locations. • The namenode uses the network location when determining where to place block replicas, the MapReduce scheduler uses network location to determine where the closest replica is as input to a map task. • For the network , the rack topology is described by two network locations, say, /switch1/rack1 and /switch1/rack2. • Since there is only one top-level switch in this cluster, the locations can be simplified to /rack1 and /rack2
  • 10. • The Hadoop configuration must specify a map between node addresses and network locations. • The map is described by a Java interface, DNSToSwitchMapping, whose signature is: public interface DNSToSwitchMapping { public List resolve(List names); } • The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings.
  • 11. Cluster Setup and installation • Ssh configuration • The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide operations. • For example, there is a script for stopping and starting all the daemons in the cluster. • To work seamlessly, SSH needs to be set up to allow password-less login for the hadoop user from machines in the cluster • The simplest way to achieve this is to generate a public/private key pair, and place it in an NFS location that is shared across the cluster. • First, generate an RSA key pair by typing the following in the hadoop user account: • % ssh-keygen -t rsa -f ~/.ssh/id_rsa
  • 12. • Even though we want password-less logins, keys without passphrases are not considered good practice, specify a passphrase when prompted for one. • We shall use ssh-agent to avoid the need to enter a password for each connection. • The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is stored in a file with the same name with .pub appended, ~/.ssh/id_rsa.pub.
  • 13. • Next we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. • If the hadoop user’s home directory is an NFS filesystem, then the keys can be shared across the cluster by typing: • % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys • If the home directory is not shared using NFS, then the public keys will need to be shared by some other means. $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
  • 16. • System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default. • This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh. • A common choice is /var/log/hadoop, set by including line in hadoop-env.sh: export HADOOP_LOG_DIR=/var/log/hadoop • The log directory will be created if it doesn’t already exist. • Each Hadoop daemon running on a machine produces two logfiles. • The first is the log output written via log4j. This file, which ends in .log, should be the first port of call when diagnosing problems, since most application log messages are written here. • Periodically old files are deleted
  • 17. • The second logfile is the combined standard output and standard error log. • This logfile, which ends in .out, usually contains little or no output, since Hadoop uses log4j for logging. It is only rotated when the daemon is restarted, and only the last five logs are retained. Old logfiles are suffixed with a number between 1 and 5, with 5 being the oldest file.
  • 21. Security • HDFS file permissions provide only a mechanism for Authorization, which controls what a particular user can do to a particular file • But authentication ? --- secure • Authentication is a mechanism to assure in Hadoop that the user seeking to perform an operation on the cluster is who they claim to be and therefore trusted
  • 22. Security • the file permissions system in HDFS prevents one user from accidentally wiping out the whole filesystem from a bug in a program, or by mistakenly typing • hadoop fs -rmr / , but it doesn’t prevent a malicious user from assuming root’s identity to access or delete any data in the cluster
  • 23. • Kerberos, a mature open-source network authentication protocol, to authenticate the user. • In turn, Kerberos doesn’t manage permissions. • Kerberos says that a user is who they say they are; it’s Hadoop’s job to determine whether that user has permission to perform a given action
  • 24. 24 Kerberos 4 Overview • a basic third-party authentication scheme • have an Authentication Server (AS) – users initially negotiate with AS to identify themselves – AS provides a non-corruptible authentication credential (ticket granting ticket TGT) • have a Ticket Granting server (TGS) – users subsequently request access to other services from TGS on basis of users TGT
  • 25. Kerberos and Hadoop At a high level, there are three steps that a client must take to access a service when using Kerberos, each of which involves a message exchange with a server: 1. Authentication. The client authenticates itself to the Authentication Server and receives a timestamped Ticket-Granting Ticket (TGT). 2. Authorization. The client uses the TGT to request a service ticket from the Ticket Granting Server. 3. Service Request. The client uses the service ticket to authenticate itself to the server that is providing the service the client is using. In the case of Hadoop, this might be the namenode or the jobtracker. Together, the Authentication Server and the Ticket Granting Server form the Key Dis- tribution Center (KDC).
  • 26. 26 Authentication Dialogue II • Once per user logon session • (1) C -> AS: IDC || IDTGS • (2) AS -> C: E K(C) [TicketTGS] • TicketTGS is equal to – E K(TGS) [IDC || ADC || IDTGS || TS1 || Lifetime1 ]
  • 27. 27 Explaining the fields • TGS = Ticket-granting server • IDTGS = Identifier of the TGS • TicketTGS = Ticket-granting ticket or TGT • TS1 = timestamp • Lifetime1 = lifetime for the TGT • K (C) = key derived from user’s password
  • 28. 28 Messages (3) and (4) • Once per type of service • (3) C -> TGS: IDC || IDV || TicketTGS • (4) TGS -> C : TicketV • TicketV is equal to – E K(V) [ IDC || ADC || IDV || TS2 || Lifetime2 ] K(V): key shared between V and TGS Is called the service-granting ticket (SGT)
  • 29. 29 Message 5 • Once per service session • (5) C -> V: IDC || TicketV • C says to V “I am IDC and have a ticket from the TGS” . Let me in! • Seems secure, but.. – There are problems
  • 30. 30 Version 4 Authentication Dialogue Authentication Service Exhange: To obtain Ticket-Granting Ticket (1) C  AS: IDc || IDtgs ||TS1 (2) AS  C: EKc [Kc,tgs|| IDtgs || TS2 || Lifetime2 || Tickettgs] Ticket-Granting Service Echange: To obtain Service-Granting Ticket (3) C  TGS: IDv ||Tickettgs ||Authenticatorc (4) TGS  C: EKc [Kc,¨v|| IDv || TS4 || Ticketv] Client/Server Authentication Exhange: To Obtain Service (5) C  V: Ticketv || Authenticatorc (6) V  C: EKc,v[TS5 +1]
  • 32. Kerberos and Hadoop • 3 steps • 1. Authentication : C -> AS & AS-> C [TGT with timestamp] • 2.Authorisation: C->TGS [TGT to request a service ticket] • 3. Service Request : C uses service ticket to authenticate itself to server for using service. • [ In hadoop can be access of namenode/jobtracker]
  • 33. • The authorization and service request steps are not user-level actions: the client performs these steps on the user’s behalf. • The authentication step, however, is normally carried out explicitly by the user using the kinit command, which will prompt for a password. • However, this doesn’t mean you need to enter your password every time you run a job or access HDFS, since TGTs last for 10 hours by default (and can be renewed for up to a week). • In cases where you don’t want to be prompted for a password (for running an unattended MapReduce job, for example), you can create a Kerberos keytab file using the ktutil command. A keytab is a file that stores passwords and may be supplied to kinit with the -t option
  • 35. • The first step is to enable Kerberos authentication by setting the • hadoop.security.authentication property in core-site.xml to kerberos. • The default setting is simple
  • 36. • To enable service-level authorization by setting hadoop.security.authorization to true in the same file. • You may configure Access Control Lists (ACLs) in the hadoop-policy.xml configuration file to control which users and groups have permission to connect to each Hadoop service
  • 37. Delegation Token • HDFS read operation will involve multiple calls to the namenode and calls to one or more datanodes. • Instead of using the three-step Kerberos ticket exchange protocol to authenticate each call, • So high load on the KDC on a busy cluster, Hadoop uses delegation tokens to allow later authenticated access without having to contact the KDC again.
  • 38. • A delegation token is generated by the server (the namenode in this case), and can be thought of as a shared secret between the client and the server. • On the first RPC call to the namenode, the client has no delegation token, so it uses Kerberos to authenticate, • and as a part of the response it gets a delegation token from the namenode. In subsequent calls, it presents the delegation token, which the namenode can verify (since it generated it using a secret key), and hence the client is authenticated to the server.
  • 39. • When it wants to perform operations on HDFS blocks, the client uses a special kind of delegation token, called a block access token, that the namenode passes to the client in response to a metadata request. • The client uses the block access token to authenticate itself to datanodes. • A HDFS block may only be accessed by a client with a valid block access token from a namenode. • This closes the security hole in unsecured Hadoop where only the block ID was needed to gain access to a block. • This property is enabled by setting dfs.block.access.token.enable to true. • Delegation tokens are used by the jobtracker and tasktrackers to access HDFS • When the job has finished, the delegation tokens are invalidated.

Editor's Notes

  • #24: The core of Kerberos is the Authentication and Ticket Granting Servers – these are trusted by all users and servers and must be securely administered.