Setting_up_hadoop_cluster_Detailed-overview

Cluster Specification
• Hadoop is designed to run on commodity hardware.
• Not tied to expensive; rather commonly available hardware from any of a large
range of vendors to build your cluster
• “Commodity” does not mean “low-end.”
• Low-end machines often have cheap components, which have higher failure
rates than more expensive (but still commodity class) machines
• When you are operating tens, hundreds, or thousands of machines, cheap
components turn out to be a false economy, as the higher failure rate incurs a
greater maintenance cost.
• On the other hand, large database class machines are not recommended either,
since they don’t score well on the price/performance curve.

• Hardware specifications rapidly become
obsolete, would have the following
specifications:
• Processor 2 quad-core 2-2.5GHz CPUs
• Memory 16-24 GB ECC RAM
• Storage 4 × 1TB SATA disks
• Network Gigabit Ethernet

Why Not Use RAID?
• HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for
datanode storage
• The redundancy that RAID provides is not needed, since HDFS handles it by replication between
nodes.
• RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower
than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS
blocks between all disks.
• The reason for this is that RAID 0 read and write operations are limited by the speed of the slowest
disk in the RAID array. In JBOD, disk operations are independent, so the average speed of
operations is greater than that of the slowest disk.
• JBOD performed 10% faster than RAID 0 in one test , and 30% better in another (HDFS write
throughput).
• Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed
disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to
become unavailable.

Network Topology
• A common Hadoop cluster architecture consists of a two-level
network topology,
• Typically there are 30 to 40 servers per rack, with a 1 GB switch
for the rack (only three are shown in the diagram), and an uplink
to a core switch or router (which is normally 1 GB or better).
• The salient point is that the aggregate bandwidth between nodes
on the same rack is much greater than that between nodes on
different racks

Rack awareness
• To get maximum performance out of Hadoop, it is important to configure
Hadoop so that it knows the topology of your network.
• If your cluster runs on a single rack, then there is nothing more to do,
since this is the default.
• However, for multirack clusters, you need to map nodes to racks.
• By doing this, Hadoop will prefer within-rack transfers (where there is
more bandwidth available) to off-rack transfers when placing MapReduce
tasks on nodes.
• HDFS will be able to place replicas more intelligently to trade-off
performance

• Network locations such as nodes and racks are represented in a tree, which
reflects the network “distance” between locations.
• The namenode uses the network location when determining where to place
block replicas, the MapReduce scheduler uses network location to
determine where the closest replica is as input to a map task.
• For the network , the rack topology is described by two network locations,
say, /switch1/rack1 and /switch1/rack2.
• Since there is only one top-level switch in this cluster, the locations can be
simplified to /rack1 and /rack2

• The Hadoop configuration must specify a map between node addresses
and network locations.
• The map is described by a Java interface, DNSToSwitchMapping,
whose signature is:
public interface DNSToSwitchMapping
{ public List resolve(List names);
}
• The names parameter is a list of IP addresses, and the return value is a
list of corresponding network location strings.

Cluster Setup and installation
• Ssh configuration
• The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide operations.
• For example, there is a script for stopping and starting all the daemons in the
cluster.
• To work seamlessly, SSH needs to be set up to allow password-less login for the hadoop user from
machines in the cluster
• The simplest way to achieve this is to generate a public/private key pair, and place it in an NFS
location that is shared across the cluster.
• First, generate an RSA key pair by typing the following in the hadoop user account:
• % ssh-keygen -t rsa -f ~/.ssh/id_rsa

• Even though we want password-less logins, keys without passphrases
are not considered good practice, specify a passphrase when prompted
for one.
• We shall use ssh-agent to avoid the need to enter a password for each
connection.
• The private key is in the file specified by the -f option, ~/.ssh/id_rsa,
and the public key is stored in a file with the same name with .pub
appended, ~/.ssh/id_rsa.pub.

• Next we need to make sure that the public key is in the
~/.ssh/authorized_keys file on all the machines in the cluster
that we want to connect to.
• If the hadoop user’s home directory is an NFS filesystem,
then the keys can be shared across the cluster by typing:
• % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
• If the home directory is not shared using NFS, then the
public keys will need to be shared by some other means.
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

• System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
• This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
• A common choice is /var/log/hadoop, set by including line in hadoop-env.sh:
export HADOOP_LOG_DIR=/var/log/hadoop
• The log directory will be created if it doesn’t already exist.
• Each Hadoop daemon running on a machine produces two logfiles.
• The first is the log output written via log4j. This file, which ends in .log, should be the
first port of call when diagnosing problems, since most application log messages are
written here.
• Periodically old files are deleted

• The second logfile is the combined standard output and standard error log.
• This logfile, which ends in .out, usually contains little or no output, since
Hadoop uses log4j for logging. It is only rotated when the daemon is
restarted, and only the last five logs are retained. Old logfiles are suffixed
with a number between 1 and 5, with 5 being the oldest file.

Security
• HDFS file permissions provide only a mechanism for
Authorization, which controls what a particular user
can do to a particular file
• But authentication ? --- secure
• Authentication is a mechanism to assure in Hadoop
that the user seeking to perform an operation on the
cluster is who they claim to be and therefore trusted

Security
• the file permissions system in HDFS prevents one user from accidentally
wiping out the whole filesystem from a bug in a program, or by mistakenly
typing
• hadoop fs -rmr /
, but it doesn’t prevent a malicious user from assuming root’s identity to access or
delete any data in the cluster

• Kerberos, a mature open-source network
authentication protocol, to authenticate the
user.
• In turn, Kerberos doesn’t manage permissions.
• Kerberos says that a user is who they say they
are; it’s Hadoop’s job to determine whether
that user has permission to perform a given
action

24
Kerberos 4 Overview
• a basic third-party authentication scheme
• have an Authentication Server (AS)
– users initially negotiate with AS to identify
themselves
– AS provides a non-corruptible authentication
credential (ticket granting ticket TGT)
• have a Ticket Granting server (TGS)
– users subsequently request access to other
services from TGS on basis of users TGT

Kerberos and Hadoop
At a high level, there are three steps that a client must take to access a service when
using Kerberos, each of which involves a message exchange with a server:
1.
Authentication.
The client authenticates itself to the Authentication Server and
receives a timestamped Ticket-Granting Ticket (TGT).
2.
Authorization.
The client uses the TGT to request a service ticket from the Ticket
Granting Server.
3.
Service Request.
The client uses the service ticket to authenticate itself to the server
that is providing the service the client is using. In the case of Hadoop, this might
be the namenode or the jobtracker.
Together, the Authentication Server and the Ticket Granting Server form the
Key Dis-
tribution Center
(KDC).

26
Authentication Dialogue II
• Once per user logon session
• (1) C -> AS: IDC || IDTGS
• (2) AS -> C: E K(C) [TicketTGS]
• TicketTGS is equal to
– E K(TGS) [IDC || ADC || IDTGS
|| TS1 || Lifetime1 ]

27
Explaining the fields
• TGS = Ticket-granting server
• IDTGS = Identifier of the TGS
• TicketTGS = Ticket-granting ticket or TGT
• TS1 = timestamp
• Lifetime1 = lifetime for the TGT
• K (C) = key derived from user’s password

28
Messages (3) and (4)
• Once per type of service
• (3) C -> TGS: IDC || IDV || TicketTGS
• (4) TGS -> C : TicketV
• TicketV is equal to
– E K(V) [ IDC || ADC || IDV ||
TS2 || Lifetime2 ]
K(V): key shared between V and TGS
Is called the service-granting ticket (SGT)

29
Message 5
• Once per service session
• (5) C -> V: IDC || TicketV
• C says to V “I am IDC and have a ticket from
the TGS” . Let me in!
• Seems secure, but..
– There are problems

30
Version 4 Authentication Dialogue
Authentication Service Exhange: To obtain Ticket-Granting Ticket
(1) C  AS: IDc || IDtgs ||TS1
(2) AS  C: EKc [Kc,tgs|| IDtgs || TS2 || Lifetime2 || Tickettgs]
Ticket-Granting Service Echange: To obtain Service-Granting Ticket
(3) C  TGS: IDv ||Tickettgs ||Authenticatorc
(4) TGS  C: EKc [Kc,¨v|| IDv || TS4 || Ticketv]
Client/Server Authentication Exhange: To Obtain Service
(5) C  V: Ticketv || Authenticatorc
(6) V  C: EKc,v[TS5 +1]

Kerberos and Hadoop
• 3 steps
• 1. Authentication : C -> AS & AS-> C [TGT with
timestamp]
• 2.Authorisation: C->TGS [TGT to request a service
ticket]
• 3. Service Request : C uses service ticket to
authenticate itself to server for using service.
• [ In hadoop can be access of
namenode/jobtracker]

• The authorization and service request steps are not user-level actions: the client
performs these steps on the user’s behalf.
• The authentication step, however, is normally carried out explicitly by the user
using the kinit command, which will prompt for a password.
• However, this doesn’t mean you need to enter your password every time you run a
job or access HDFS, since TGTs last for 10 hours by default (and can be renewed
for up to a week).
• In cases where you don’t want to be prompted for a password (for running an
unattended MapReduce job, for example), you can create a Kerberos keytab file
using the ktutil command. A keytab is a file that stores passwords and may be
supplied to kinit with the -t option

• The first step is to enable Kerberos authentication by
setting the
• hadoop.security.authentication property in core-site.xml to
kerberos.
• The default setting is simple

• To enable service-level authorization by setting
hadoop.security.authorization to true in the same file.
• You may configure Access Control Lists (ACLs) in the
hadoop-policy.xml configuration file to control which users
and groups have permission to connect to each Hadoop
service

Delegation Token
• HDFS read operation will involve multiple calls to the namenode and calls
to one or more datanodes.
• Instead of using the three-step Kerberos ticket exchange protocol to
authenticate each call,
• So high load on the KDC on a busy cluster, Hadoop uses delegation tokens
to allow later authenticated access without having to contact the KDC
again.

• A delegation token is generated by the server (the namenode in this case),
and can be thought of as a shared secret between the client and the server.
• On the first RPC call to the namenode, the client has no delegation token,
so it uses Kerberos to authenticate,
• and as a part of the response it gets a delegation token from the namenode.
In subsequent calls, it presents the delegation token, which the namenode
can verify (since it generated it using a secret key), and hence the client is
authenticated to the server.

• When it wants to perform operations on HDFS blocks, the client uses a special kind of
delegation token, called a block access token, that the namenode passes to the client in
response to a metadata request.
• The client uses the block access token to authenticate itself to datanodes.
• A HDFS block may only be accessed by a client with a valid block access token from a
namenode.
• This closes the security hole in unsecured Hadoop where only the block ID was needed
to gain access to a block.
• This property is enabled by setting dfs.block.access.token.enable to true.
• Delegation tokens are used by the jobtracker and tasktrackers to access HDFS
• When the job has finished, the delegation tokens are invalidated.

Setting_up_hadoop_cluster_Detailed-overview

More Related Content

Similar to Setting_up_hadoop_cluster_Detailed-overview (20)

Recently uploaded (20)

Setting_up_hadoop_cluster_Detailed-overview

Editor's Notes