SlideShare a Scribd company logo
Setting Up Your First Hadoop Cluster Chad Vawter TriHUG Meeting: July 20, 2010
Speaker Background Netlib and Parallel Virtual Machine (PVM) High-volume messaging, complex event processing (CEP), and predictive data mining SOA/ESB at the U.S. Department of Homeland Security Banking: BPM, ETL, Reporting and Analytics Interests: Mahout and R/Hadoop, Functional and OO languages for the JVM (Clojure, Scala, etc.)
Goals High-level overview of the prerequisites to Hadoop cluster installation and operation High-level overview of the Hadoop configuration files
Hadoop Prerequisites Supported Operating Systems Linux Mac OS/X BSD OpenSolaris Windows Need Cygwin (especially OpenSSH) Java Service Wrapper from Tanuki Software Supported Java (JRE) versions Java 6 or later
Let’s use Linux…
Hadoop Distributions Apache Hadoop Cloudera Cloudera’s Distribution for Hadoop (CDH) Flume  – streaming data collection (e.g., log files) Oozie  – Yahoo!’s workflow engine for complex Hadoop jobs and data pipelines Sqoop  - SQL-to-Hadoop database import and export tool Hadoop User Environment (Hue)  – UI framework and SDK for visual Hadoop applications Cloudera Enterprise CDH + management and monitoring tools and production support services Yahoo! Distribution of Hadoop Code patches for performance and stability Security Oozie
Install the Apache Hadoop Distribution Create a user and group for ownership and permissions e.g.,  hadoop:hadoop Download Hadoop from the Apache Hadoop releases page: http://hadoop.apache.org/common/releases.html
Hadoop Configuration SSH  configuration Hadoop control scripts communicate with machines in a Hadoop cluster via SSH. Hadoop  environment  configuration Configure the environment in which the Hadoop daemons run. Configuration parameters for the Hadoop  daemons NameNode / DataNode JobTracker / TaskTracker
SSH Configuration Hadoop control scripts use SSH for cluster-wide operations, so… In the  hadoop  user account’s home directory, generate a public/private key pair: ssh-keygen –t rsa –f ~/.ssh/id_rsa The private key will be in the  ~/.ssh/id_rsa  file. The public key will be in the  ~/.ssh/id_rsa.pub  file.
SSH Configuration (continued) The public key must be in the  ~/.ssh/authorized_keys  file on each machine in the Hadoop cluster: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Use  ssh-agent  to avoid having to type the passphrase of the private key when connecting from one machine in the Hadoop cluster to another. Run  ssh-add  to store the passphrase. We now have secure, encrypted passwordless logins.
The Hadoop “Environment” Each machine in a Hadoop cluster has a configuration script for environment settings. Edit the  hadoop-env.sh  Bash script on each machine or have a mechanism for sharing environment settings; e.g.,  rsync . Values for many environment variables can be identical for all machines in the cluster.  Not all machines will have the same hardware profile, though.  Configure each machine’s Hadoop environment so that it best uses its resources.
hadoop-env.sh JAVA_HOME HADOOP_HOME HADOOP_LOG_DIR HADOOP_PID_DIR HADOOP_NAMENODE_OPTS HADOOP_DATANODE_OPTS HADOOP_SECONDARYNAMENODE_OPTS HADOOP_JOBTRACKER_OPTS HADOOP_TASKTRACKER_OPTS HADOOP_HEAPSIZE HADOOP_SLAVES HADOOP_SSH_OPTS HADOOP_MASTER and HADOOP_SLAVE_SLEEP …
Read-Only Default Configuration Files src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml
Site-Specific Configuration Files Override the values provided in the default configuration files: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml
Other Configuration Files slaves This file defines which machines will run datanodes and/or tasktrackers Note: We don’t need to specify which machine(s) will run a NameNode and/or a JobTracker.  The Hadoop control scripts are responsible for  NamNode  and  JobTracker  nodes when they are run on a given machine.  hadoop-metrics.properties log4j.properties
Hadoop Startup Format a new distributed file system: bin/hadoop namenode –format Start the HDFS on the designated NameNode: bin/start-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and starts a DataNode daemon on each of the listed slaves. Start MapReduce on the designated JobTracker: bin/start-mapred.sh The start-mapred.sh scripts consults the conf/slaves file on the JobTracker and starts a TaskTracker daemon on each of the listed slaves.
Hadoop Shutdown Stop the HDFS on the designated NameNode: bin/stop-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and stops the DataNode daemon on each of the listed slaves. Stop MapReduce on the designated JobTracker: bin/stop-mapred.sh The stop-mapred.sh scripts consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on each of the listed slaves.
Other Hadoop Installation Options Cloud Computing with Hadoop  Amazon EC2 Xen open-source virtual machine monitor (hypervisor) Amazon Elastic MapReduce VMware vCloud Windows Azure? …
TriHUG Meeting Suggestions? Hadoop Performance-Tuning with Advanced Configuration Data Warehousing and Large-Scale Extraction, Transformation and Loading (ETL) with Hadoop High-Volume Reporting with Hadoop Hadoop and Object-Functional Languages for the JVM Others?
Resources - Hadoop Apache Hadoop http://hadoop.apache.org/ Hadoop: The Definitive Guide http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-1 Downloading and Installing Hadoop http://wiki.apache.org/hadoop/GettingStartedWithHadoop Cloudera’s Hadoop Distribution http://www.cloudera.com/ Yahoo’s Hadoop Distribution http://developer.yahoo.com/hadoop/
Resources - Hadoop ( continued ) Supported Java Versions http://wiki.apache.org/hadoop/HadoopJavaVersions Hadoop on Windows with Eclipse http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html
Resources - Amazon EC2 Amazon Elastic Compute Cloud (EC2) http://aws.amazon.com/ec2/ Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ EC2 Starter’s Guide for Ubuntu https://help.ubuntu.com/community/EC2StartersGuide
Resources - Miscellaneous Xen Open-Source Virtual Machine Monitor http://www.xen.org/ Virtualization - Comparison http://www.virtualbox.org/wiki/VBox_vs_Others
Keep in Touch [email_address]

More Related Content

What's hot (20)

PDF
Hadoop operations basic
Hafizur Rahman
 
PDF
Introduction to Hadoop
joelcrabb
 
PDF
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
PDF
Introduction to hadoop administration jk
Edureka!
 
PPTX
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari
 
PPTX
Hadoop Installation presentation
puneet yadav
 
PPTX
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
PDF
Administer Hadoop Cluster
Edureka!
 
ODT
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
PDF
Setting High Availability in Hadoop Cluster
Edureka!
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PPTX
Pptx present
Nitish Bhardwaj
 
DOC
Hadoop cluster configuration
prabakaranbrick
 
PPTX
Hadoop installation with an example
Nikita Kesharwani
 
PPTX
Hadoop distributed file system
Anshul Bhatnagar
 
ODP
Architecture of Hadoop
Knoldus Inc.
 
PPTX
Hadoop administration
Aneesh Pulickal Karunakaran
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
PPT
Meethadoop
IIIT-H
 
Hadoop operations basic
Hafizur Rahman
 
Introduction to Hadoop
joelcrabb
 
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Introduction to hadoop administration jk
Edureka!
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari
 
Hadoop Installation presentation
puneet yadav
 
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Administer Hadoop Cluster
Edureka!
 
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Setting High Availability in Hadoop Cluster
Edureka!
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Pptx present
Nitish Bhardwaj
 
Hadoop cluster configuration
prabakaranbrick
 
Hadoop installation with an example
Nikita Kesharwani
 
Hadoop distributed file system
Anshul Bhatnagar
 
Architecture of Hadoop
Knoldus Inc.
 
Hadoop administration
Aneesh Pulickal Karunakaran
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Meethadoop
IIIT-H
 

Similar to July 2010 Triangle Hadoop Users Group - Chad Vawter Slides (20)

PDF
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
PPTX
Big data processing using hadoop poster presentation
Amrut Patil
 
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
DOCX
Single node setup
KBCHOW123
 
PPTX
Hadoop installation on windows
habeebulla g
 
PPTX
Exp-3.pptx
PraveenKumar581409
 
PPTX
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
PPTX
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
PDF
Deploy hadoop cluster
Chirag Ahuja
 
PDF
R hive tutorial supplement 1 - Installing Hadoop
Aiden Seonghak Hong
 
PPTX
Learn Hadoop Administration
Edureka!
 
PPTX
Hadoop 2.4 installing on ubuntu 14.04
baabtra.com - No. 1 supplier of quality freshers
 
PDF
Power Hadoop Cluster with AWS Cloud
Edureka!
 
PDF
Hadoop single node installation on ubuntu 14
jijukjoseph
 
PDF
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 
PPT
Setting_up_hadoop_cluster_Detailed-overview
oyqhmysnxozaxsqfac
 
DOC
Configure h base hadoop and hbase client
Shashwat Shriparv
 
PDF
Top 5 Hadoop Admin Tasks
Edureka!
 
PDF
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
PDF
Setting up a HADOOP 2.2 cluster on CentOS 6
Manish Chopra
 
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Big data processing using hadoop poster presentation
Amrut Patil
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
Single node setup
KBCHOW123
 
Hadoop installation on windows
habeebulla g
 
Exp-3.pptx
PraveenKumar581409
 
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
Deploy hadoop cluster
Chirag Ahuja
 
R hive tutorial supplement 1 - Installing Hadoop
Aiden Seonghak Hong
 
Learn Hadoop Administration
Edureka!
 
Hadoop 2.4 installing on ubuntu 14.04
baabtra.com - No. 1 supplier of quality freshers
 
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 
Setting_up_hadoop_cluster_Detailed-overview
oyqhmysnxozaxsqfac
 
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Top 5 Hadoop Admin Tasks
Edureka!
 
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Manish Chopra
 
Ad

More from ryancox (6)

PPT
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
ryancox
 
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
ryancox
 
ZIP
Hadoop New And Note - December 2010 TriHUG
ryancox
 
PPT
Tri hug 2010 wei
ryancox
 
PPTX
Megadata With Python and Hadoop
ryancox
 
PPT
dtrace
ryancox
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
ryancox
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
ryancox
 
Hadoop New And Note - December 2010 TriHUG
ryancox
 
Tri hug 2010 wei
ryancox
 
Megadata With Python and Hadoop
ryancox
 
dtrace
ryancox
 
Ad

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

  • 1. Setting Up Your First Hadoop Cluster Chad Vawter TriHUG Meeting: July 20, 2010
  • 2. Speaker Background Netlib and Parallel Virtual Machine (PVM) High-volume messaging, complex event processing (CEP), and predictive data mining SOA/ESB at the U.S. Department of Homeland Security Banking: BPM, ETL, Reporting and Analytics Interests: Mahout and R/Hadoop, Functional and OO languages for the JVM (Clojure, Scala, etc.)
  • 3. Goals High-level overview of the prerequisites to Hadoop cluster installation and operation High-level overview of the Hadoop configuration files
  • 4. Hadoop Prerequisites Supported Operating Systems Linux Mac OS/X BSD OpenSolaris Windows Need Cygwin (especially OpenSSH) Java Service Wrapper from Tanuki Software Supported Java (JRE) versions Java 6 or later
  • 6. Hadoop Distributions Apache Hadoop Cloudera Cloudera’s Distribution for Hadoop (CDH) Flume – streaming data collection (e.g., log files) Oozie – Yahoo!’s workflow engine for complex Hadoop jobs and data pipelines Sqoop - SQL-to-Hadoop database import and export tool Hadoop User Environment (Hue) – UI framework and SDK for visual Hadoop applications Cloudera Enterprise CDH + management and monitoring tools and production support services Yahoo! Distribution of Hadoop Code patches for performance and stability Security Oozie
  • 7. Install the Apache Hadoop Distribution Create a user and group for ownership and permissions e.g., hadoop:hadoop Download Hadoop from the Apache Hadoop releases page: http://hadoop.apache.org/common/releases.html
  • 8. Hadoop Configuration SSH configuration Hadoop control scripts communicate with machines in a Hadoop cluster via SSH. Hadoop environment configuration Configure the environment in which the Hadoop daemons run. Configuration parameters for the Hadoop daemons NameNode / DataNode JobTracker / TaskTracker
  • 9. SSH Configuration Hadoop control scripts use SSH for cluster-wide operations, so… In the hadoop user account’s home directory, generate a public/private key pair: ssh-keygen –t rsa –f ~/.ssh/id_rsa The private key will be in the ~/.ssh/id_rsa file. The public key will be in the ~/.ssh/id_rsa.pub file.
  • 10. SSH Configuration (continued) The public key must be in the ~/.ssh/authorized_keys file on each machine in the Hadoop cluster: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Use ssh-agent to avoid having to type the passphrase of the private key when connecting from one machine in the Hadoop cluster to another. Run ssh-add to store the passphrase. We now have secure, encrypted passwordless logins.
  • 11. The Hadoop “Environment” Each machine in a Hadoop cluster has a configuration script for environment settings. Edit the hadoop-env.sh Bash script on each machine or have a mechanism for sharing environment settings; e.g., rsync . Values for many environment variables can be identical for all machines in the cluster. Not all machines will have the same hardware profile, though. Configure each machine’s Hadoop environment so that it best uses its resources.
  • 12. hadoop-env.sh JAVA_HOME HADOOP_HOME HADOOP_LOG_DIR HADOOP_PID_DIR HADOOP_NAMENODE_OPTS HADOOP_DATANODE_OPTS HADOOP_SECONDARYNAMENODE_OPTS HADOOP_JOBTRACKER_OPTS HADOOP_TASKTRACKER_OPTS HADOOP_HEAPSIZE HADOOP_SLAVES HADOOP_SSH_OPTS HADOOP_MASTER and HADOOP_SLAVE_SLEEP …
  • 13. Read-Only Default Configuration Files src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml
  • 14. Site-Specific Configuration Files Override the values provided in the default configuration files: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml
  • 15. Other Configuration Files slaves This file defines which machines will run datanodes and/or tasktrackers Note: We don’t need to specify which machine(s) will run a NameNode and/or a JobTracker. The Hadoop control scripts are responsible for NamNode and JobTracker nodes when they are run on a given machine. hadoop-metrics.properties log4j.properties
  • 16. Hadoop Startup Format a new distributed file system: bin/hadoop namenode –format Start the HDFS on the designated NameNode: bin/start-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and starts a DataNode daemon on each of the listed slaves. Start MapReduce on the designated JobTracker: bin/start-mapred.sh The start-mapred.sh scripts consults the conf/slaves file on the JobTracker and starts a TaskTracker daemon on each of the listed slaves.
  • 17. Hadoop Shutdown Stop the HDFS on the designated NameNode: bin/stop-dfs.sh The start-dfs.sh scripts consults the conf/slaves file on the NameNode and stops the DataNode daemon on each of the listed slaves. Stop MapReduce on the designated JobTracker: bin/stop-mapred.sh The stop-mapred.sh scripts consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on each of the listed slaves.
  • 18. Other Hadoop Installation Options Cloud Computing with Hadoop Amazon EC2 Xen open-source virtual machine monitor (hypervisor) Amazon Elastic MapReduce VMware vCloud Windows Azure? …
  • 19. TriHUG Meeting Suggestions? Hadoop Performance-Tuning with Advanced Configuration Data Warehousing and Large-Scale Extraction, Transformation and Loading (ETL) with Hadoop High-Volume Reporting with Hadoop Hadoop and Object-Functional Languages for the JVM Others?
  • 20. Resources - Hadoop Apache Hadoop http://hadoop.apache.org/ Hadoop: The Definitive Guide http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979/ref=sr_1_1?ie=UTF8&s=books&qid=1279640275&sr=8-1 Downloading and Installing Hadoop http://wiki.apache.org/hadoop/GettingStartedWithHadoop Cloudera’s Hadoop Distribution http://www.cloudera.com/ Yahoo’s Hadoop Distribution http://developer.yahoo.com/hadoop/
  • 21. Resources - Hadoop ( continued ) Supported Java Versions http://wiki.apache.org/hadoop/HadoopJavaVersions Hadoop on Windows with Eclipse http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html
  • 22. Resources - Amazon EC2 Amazon Elastic Compute Cloud (EC2) http://aws.amazon.com/ec2/ Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ EC2 Starter’s Guide for Ubuntu https://help.ubuntu.com/community/EC2StartersGuide
  • 23. Resources - Miscellaneous Xen Open-Source Virtual Machine Monitor http://www.xen.org/ Virtualization - Comparison http://www.virtualbox.org/wiki/VBox_vs_Others
  • 24. Keep in Touch [email_address]

Editor's Notes

  • #3: Netlib: CS? HPC at ORNL; PVM/MPI as precursors to Hadoop; Not for exactly the same purposes (explain) Messaging/CEP/Analytics: Algorithmic trading - Usually predictive data mining for definition of complex events prior to CEP runtime JTV/DHS: CEP for anti-money laundering (AML), etc. BI: 180,000 reports daily; large-scale ETL (Pentaho and/or Talend) Interests: Links to Scala/Hadoop, R/Hadoop