July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Setting Up Your First Hadoop Cluster Chad Vawter TriHUG Meeting: July 20, 2010

Speaker Background Netlib and Parallel Virtual Machine (PVM) High-volume messaging, complex event processing (CEP), and predictive data mining SOA/ESB at the U.S. Department of Homeland Security Banking: BPM, ETL, Reporting and Analytics Interests: Mahout and R/Hadoop, Functional and OO languages for the JVM (Clojure, Scala, etc.)

Goals High-level overview of the prerequisites to Hadoop cluster installation and operation High-level overview of the Hadoop configuration files

Hadoop Prerequisites Supported Operating Systems Linux Mac OS/X BSD OpenSolaris Windows Need Cygwin (especially OpenSSH) Java Service Wrapper from Tanuki Software Supported Java (JRE) versions Java 6 or later

Hadoop Distributions Apache Hadoop Cloudera Cloudera’s Distribution for Hadoop (CDH) Flume – streaming data collection (e.g., log files) Oozie – Yahoo!’s workflow engine for complex Hadoop jobs and data pipelines Sqoop - SQL-to-Hadoop database import and export tool Hadoop User Environment (Hue) – UI framework and SDK for visual Hadoop applications Cloudera Enterprise CDH + management and monitoring tools and production support services Yahoo! Distribution of Hadoop Code patches for performance and stability Security Oozie

Install the Apache Hadoop Distribution Create a user and group for ownership and permissions e.g., hadoop:hadoop Download Hadoop from the Apache Hadoop releases page: http://hadoop.apache.org/common/releases.html

Hadoop Configuration SSH configuration Hadoop control scripts communicate with machines in a Hadoop cluster via SSH. Hadoop environment configuration Configure the environment in which the Hadoop daemons run. Configuration parameters for the Hadoop daemons NameNode / DataNode JobTracker / TaskTracker

SSH Configuration Hadoop control scripts use SSH for cluster-wide operations, so… In the hadoop user account’s home directory, generate a public/private key pair: ssh-keygen –t rsa –f ~/.ssh/id_rsa The private key will be in the ~/.ssh/id_rsa file. The public key will be in the ~/.ssh/id_rsa.pub file.

SSH Configuration (continued) The public key must be in the ~/.ssh/authorized_keys file on each machine in the Hadoop cluster: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Use ssh-agent to avoid having to type the passphrase of the private key when connecting from one machine in the Hadoop cluster to another. Run ssh-add to store the passphrase. We now have secure, encrypted passwordless logins.

The Hadoop “Environment” Each machine in a Hadoop cluster has a configuration script for environment settings. Edit the hadoop-env.sh Bash script on each machine or have a mechanism for sharing environment settings; e.g., rsync . Values for many environment variables can be identical for all machines in the cluster. Not all machines will have the same hardware profile, though. Configure each machine’s Hadoop environment so that it best uses its resources.

hadoop-env.sh JAVA_HOME HADOOP_HOME HADOOP_LOG_DIR HADOOP_PID_DIR HADOOP_NAMENODE_OPTS HADOOP_DATANODE_OPTS HADOOP_SECONDARYNAMENODE_OPTS HADOOP_JOBTRACKER_OPTS HADOOP_TASKTRACKER_OPTS HADOOP_HEAPSIZE HADOOP_SLAVES HADOOP_SSH_OPTS HADOOP_MASTER and HADOOP_SLAVE_SLEEP …

Read-Only Default Configuration Files src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml

Site-Specific Configuration Files Override the values provided in the default configuration files: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml

Other Configuration Files slaves This file defines which machines will run datanodes and/or tasktrackers Note: We don’t need to specify which machine(s) will run a NameNode and/or a JobTracker. The Hadoop control scripts are responsible for NamNode and JobTracker nodes when they are run on a given machine. hadoop-metrics.properties log4j.properties

Hadoop Startup Format a new distributed file system: bin/hadoop namenode –format Start the HDFS on the designated NameNode: bin/start-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and starts a DataNode daemon on each of the listed slaves. Start MapReduce on the designated JobTracker: bin/start-mapred.sh The start-mapred.sh scripts consults the conf/slaves file on the JobTracker and starts a TaskTracker daemon on each of the listed slaves.

Hadoop Shutdown Stop the HDFS on the designated NameNode: bin/stop-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and stops the DataNode daemon on each of the listed slaves. Stop MapReduce on the designated JobTracker: bin/stop-mapred.sh The stop-mapred.sh scripts consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on each of the listed slaves.

Other Hadoop Installation Options Cloud Computing with Hadoop Amazon EC2 Xen open-source virtual machine monitor (hypervisor) Amazon Elastic MapReduce VMware vCloud Windows Azure? …

TriHUG Meeting Suggestions? Hadoop Performance-Tuning with Advanced Configuration Data Warehousing and Large-Scale Extraction, Transformation and Loading (ETL) with Hadoop High-Volume Reporting with Hadoop Hadoop and Object-Functional Languages for the JVM Others?

Resources - Hadoop Apache Hadoop http://hadoop.apache.org/ Hadoop: The Definitive Guide http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-1 Downloading and Installing Hadoop http://wiki.apache.org/hadoop/GettingStartedWithHadoop Cloudera’s Hadoop Distribution http://www.cloudera.com/ Yahoo’s Hadoop Distribution http://developer.yahoo.com/hadoop/

Resources - Hadoop ( continued ) Supported Java Versions http://wiki.apache.org/hadoop/HadoopJavaVersions Hadoop on Windows with Eclipse http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html

Resources - Amazon EC2 Amazon Elastic Compute Cloud (EC2) http://aws.amazon.com/ec2/ Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ EC2 Starter’s Guide for Ubuntu https://help.ubuntu.com/community/EC2StartersGuide

Resources - Miscellaneous Xen Open-Source Virtual Machine Monitor http://www.xen.org/ Virtualization - Comparison http://www.virtualbox.org/wiki/VBox_vs_Others

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

More Related Content

What's hot (20)

Similar to July 2010 Triangle Hadoop Users Group - Chad Vawter Slides (20)

More from ryancox (6)

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Editor's Notes