SlideShare a Scribd company logo
Mapreduce Programming at
Comet and HW2 Log Analysis
UCSB CS240A 2016. Tao Yang
2
Data Analysis from Web Server Logs
Startup code and data : /home/tyang/cs240sample/log
apache1.splunk.com
apache2.splunk.com
apache3.splunk.com
02/09/2010
3
Example line of the log file
10.32.1.43 - - [06/Feb/2013:00:07:00] "GET
/flower_store/product.screen?product_id=FL-DLH-02
HTTP/1.1" 200 10901
"http://mystore.splunk.com/flower_store/category.screen
?category_id=GIFTS&JSESSIONID=SD7SL1FF9ADFF2
" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10)
Gecko/20070223 CentOS/1.5.0.10-0.1.el4.centos
Firefox/1.5.0.10" 4361 3217
66.249.64.13 - -
[18/Sep/2004:11:07:48 +1000]
"GET / HTTP/1.0" 200 6433 "-"
"Googlebot/2.1"
4
Log Format
66.249.64.13 - - [18/Sep/2004:11:07:48 +1000]
"GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
More Formal Definition of Apache Log
%h %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i“
%h = IP address of the client (remote host) which made the request
%l = RFC 1413 identity of the client
%u = userid of the person requesting the document
%t = Time that the server finished processing the request
%r = Request line from the client in double quotes
%s = Status code that the server sends back to the client
%b = Size of the object returned to the client
Referer : where the request originated
User-agent what type of agent made the request.
http://www.the-art-of-web.com/system/logs/
6
Common Response Code
• 200 - OK
• 206 - Partial Content
• 301 - Moved Permanently
• 302 - Found
• 304 - Not Modified
• 401 - Unauthorised (password required)
• 403 - Forbidden
• 404 - Not Found.
7
LogAnalyzer.java
public class LogAnalyzer {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: loganalyzer <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "analyze log");
job.setJarByClass(LogAnalyzer.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
8
Map.java
public class Map extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text url = new Text();
private Pattern p = Pattern.compile("(?:GET|POST)s([^s]+)");
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] entries = value.toString().split("r?n");
for (int i=0, len=entries.length; i<len; i+=1) {
Matcher matcher = p.matcher(entries[i]);
if (matcher.find()) {
url.set(matcher.group(1));
context.write(url, one);
}
}
}
}
9
Reduce.java
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable total = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
total.set(sum);
context.write(key, total);
}
}
10
Comet Cluster
• Comet cluster has 1944 nodes and each node has
24 cores, built on two 12-core Intel Xeon E5-2680v3
2.5 GHz processors
• 128 GB memory and 320GB SSD for local scratch
space.
• Attached storage: Shared 7 petabytes of 200
GB/second performance storage and 6 petabytes of
100 GB/second durable storage
 Lustre Storage Area is a Parallel File System
(PFS) called Data Oasis.
– Users can access from
/oasis/scratch/comet/$USER/temp_project
Home
local storage Login
node
/oasis
Hadoop installation at Comet
• Installed in /opt/hadoop/1.2.1
o Configure Hadoop on-demand with myHadoop:
 /opt/hadoop/contrib/myHadoop/bin/myhadoop-
configure.sh
Home
Linux
Hadoop connects local storage Login
node
Hadoop file system is built dynamically on the nodes
allocated. Deleted when the allocation is terminated.
Compile the sample Java code at Comet
Java word count example is available at Comet under
/home/tyang/cs240sample/mapreduce/.
• cp –r /home/tyang/cs240sample/mapreduce .
• Allocate a dedicated machine for compiling
 /share/apps/compute/interactive/qsubi.bash -p compute --
nodes=1 --ntasks-per-node=1 -t 00:
• Change work directory to mapreduce and type make
 Java code is compiled under target subdirectory
Home
Comet
Login
node
How to Run a WordCount Mapreduce Job
 Use “compute” partition for allocation
 Use Java word count example at Comet under
/home/tyang/cs240sample/mapreduce/.
 sbatch submit-hadoop-comet.sh
– Data input is in test.txt
– Data output is in WC-output
 Job trace is wordcount.1569018.comet-17-14.out
Home
Comet cluster
Login node
comet.sdsc.xsed
e.org
“compute” queue
Sample script (submit-hadoop-comet.sh)
#!/bin/bash
#SBATCH --job-name="wordcount"
#SBATCH --output="wordcount.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH -t 00:15:00
Export HADOOP_CONF_DIR=/home/$USER/cometcluster
export WORKDIR=`pwd`
module load hadoop/1.2.1
#Use myheadoop to build a Hadoop file system on allocated nodes
myhadoop-configure.sh
#Start all demons
start-all.sh Home
Linux
Login
node
Hadoop
Sample script
#make an input directory in the hadoop file system
hadoop dfs -mkdir input
#copy data from local Linux file system to the Hadoop file system
hadoop dfs -copyFromLocal $WORKDIR/test.txt input/
#Run Hadoop wordcount job
hadoop jar $WORKDIR/wordcount.jar wordcount input/ output/
# Create a local directory WC-output to host the output data
# It does not report error even the file does not exist
rm -rf WC-out >/dev/null || true
mkdir -p WC-out
# Copy out the output data
hadoop dfs -copyToLocal output/part* WC-out
#Stop all demons and cleanup
stop-all.sh
myhadoop-cleanup.sh Home
Linux Login
node
Hadoop
Sample output trace
wordcount.1569018.comet-17-14.out
starting namenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-namenode-comet-17-14.out
comet-17-14.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode-
comet-17-14.sdsc.edu.out
comet-17-15.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode-
comet-17-15.sdsc.edu.out
comet-17-14.ibnet: starting secondarynamenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-
secondarynamenode-comet-17-14.sdsc.edu.out
starting jobtracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-jobtracker-comet-17-14.out
comet-17-14.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker-
comet-17-14.sdsc.edu.out
comet-17-15.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker-
comet-17-15.sdsc.edu.out
Sample output trace
wordcount.1569018.comet-17-14.out
16/01/31 17:43:44 INFO input.FileInputFormat: Total input paths to process : 1
16/01/31 17:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/01/31 17:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded
16/01/31 17:43:44 INFO mapred.JobClient: Running job: job_201601311743_0001
16/01/31 17:43:45 INFO mapred.JobClient: map 0% reduce 0%
16/01/31 17:43:49 INFO mapred.JobClient: map 100% reduce 0%
16/01/31 17:43:56 INFO mapred.JobClient: map 100% reduce 33%
16/01/31 17:43:57 INFO mapred.JobClient: map 100% reduce 100%
16/01/31 17:43:57 INFO mapred.JobClient: Job complete: job_201601311743_0001
comet-17-14.ibnet: stopping tasktracker
comet-17-15.ibnet: stopping tasktracker
stopping namenode
comet-17-14.ibnet: stopping datanode
comet-17-15.ibnet: stopping datanode
comet-17-14.ibnet: stopping secondarynamenode
Copying Hadoop logs back to /home/tyang/cometcluster/logs...
`/scratch/tyang/1569018/logs' -> `/home/tyang/cometcluster/logs'
Home
Linux
Login
node
Hadoop
Sample input and output
$ cat test.txt
how are you today 3 4 mapreduce program
1 2 3 test send
how are you mapreduce
1 send test USA california new
$ cat WC-out/part-r-00000
1 2
2 1
3 2
4 1
USA 1
are 2
california 1
how 2
mapreduce 2
new 1
program 1
send 2
test 2
today 1
you 2
Shell Commands for Hadoop File System
• Mkdir, ls, cat, cp
 hadoop dfs -mkdir /user/deepak/dir1
 hadoop dfs -ls /user/deepak
 hadoop dfs -cat /usr/deepak/file.txt
 hadoop dfs -cp /user/deepak/dir1/abc.txt /user/deepak/dir2
• Copy data from the local file system to HDF
 hadoop dfs -copyFromLocal <src:localFileSystem>
<dest:Hdfs>
 Ex: hadoop dfs –copyFromLocal
/home/hduser/def.txt /user/deepak/dir1
• Copy data from HDF to local
 hadoop dfs -copyToLocal <src:Hdfs>
<dest:localFileSystem>
http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html
Notes
• Java process listing “jps”, shows the following demons
NameNode (master), SecondaryNameNode, Datanode
(hadoop),JobTracker, TaskTracker
• To check the status of your job
squeue -u username
• To cancel a submitted job
scancel job-id
• You have to request *all* 24 cores on the nodes. Hadoop is
java based and any memory limits start causing problems.
Also, in the compute partition you are charged for the whole
node anyway.
Notes
• Your script should delete the outout directory if you want to
rerun and copy out data to that directory. Otherwise the
Hadoop copy back fails because the file already exists.
The current script forces to remove "WC-output".
• If you are running several Mapreduce jobs simultaneously,
please make sure you choose different locations for for the
configuration files. Basically change the line:
export HADOOP_CONF_DIR=/home/$USER/cometcluster
to point to different directories for each run. Otherwise the
configuration from different jobs will overwrite in the same
directory and cause problems.

More Related Content

Similar to TopicMapReduceComet log analysis by using splunk (20)

PDF
Introduction to Spark
Li Ming Tsai
 
PDF
Debugging: Rules And Tools - PHPTek 11 Version
Ian Barber
 
PPTX
How to go the extra mile on monitoring
Tiago Simões
 
PDF
Debugging: Rules & Tools
Ian Barber
 
PDF
Recipe to build open splice dds 6.3.xxx Hello World example over Qt 5.2
Adil Khan
 
PDF
Exploring Async PHP (SF Live Berlin 2019)
dantleech
 
PDF
Querying 1.8 billion reddit comments with python
Daniel Rodriguez
 
PPT
Intrusion Detection System using Snort
webhostingguy
 
PPT
Intrusion Detection System using Snort
webhostingguy
 
ODP
Asian Spirit 3 Day Dba On Ubl
newrforce
 
PDF
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
Leo Lorieri
 
PPTX
How to automate all your SEO projects
Vincent Terrasi
 
PPTX
BDM32: AdamCloud Project - Part II
David Lauzon
 
PDF
CouchDB Mobile - From Couch to 5K in 1 Hour
Peter Friese
 
PPTX
MySQL Audit using Percona audit plugin and ELK
YoungHeon (Roy) Kim
 
PDF
Do you know what your drupal is doing? Observe it!
Luca Lusso
 
PDF
Build Automation 101
Martin Jackson
 
PDF
Easy deployment & management of cloud apps
David Cunningham
 
PDF
Kerberizing spark. Spark Summit east
Jorge Lopez-Malla
 
PPTX
Deployment with Fabric
andymccurdy
 
Introduction to Spark
Li Ming Tsai
 
Debugging: Rules And Tools - PHPTek 11 Version
Ian Barber
 
How to go the extra mile on monitoring
Tiago Simões
 
Debugging: Rules & Tools
Ian Barber
 
Recipe to build open splice dds 6.3.xxx Hello World example over Qt 5.2
Adil Khan
 
Exploring Async PHP (SF Live Berlin 2019)
dantleech
 
Querying 1.8 billion reddit comments with python
Daniel Rodriguez
 
Intrusion Detection System using Snort
webhostingguy
 
Intrusion Detection System using Snort
webhostingguy
 
Asian Spirit 3 Day Dba On Ubl
newrforce
 
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
Leo Lorieri
 
How to automate all your SEO projects
Vincent Terrasi
 
BDM32: AdamCloud Project - Part II
David Lauzon
 
CouchDB Mobile - From Couch to 5K in 1 Hour
Peter Friese
 
MySQL Audit using Percona audit plugin and ELK
YoungHeon (Roy) Kim
 
Do you know what your drupal is doing? Observe it!
Luca Lusso
 
Build Automation 101
Martin Jackson
 
Easy deployment & management of cloud apps
David Cunningham
 
Kerberizing spark. Spark Summit east
Jorge Lopez-Malla
 
Deployment with Fabric
andymccurdy
 

Recently uploaded (20)

PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Ad

TopicMapReduceComet log analysis by using splunk

  • 1. Mapreduce Programming at Comet and HW2 Log Analysis UCSB CS240A 2016. Tao Yang
  • 2. 2 Data Analysis from Web Server Logs Startup code and data : /home/tyang/cs240sample/log apache1.splunk.com apache2.splunk.com apache3.splunk.com
  • 3. 02/09/2010 3 Example line of the log file 10.32.1.43 - - [06/Feb/2013:00:07:00] "GET /flower_store/product.screen?product_id=FL-DLH-02 HTTP/1.1" 200 10901 "http://mystore.splunk.com/flower_store/category.screen ?category_id=GIFTS&JSESSIONID=SD7SL1FF9ADFF2 " "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070223 CentOS/1.5.0.10-0.1.el4.centos Firefox/1.5.0.10" 4361 3217 66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
  • 4. 4 Log Format 66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
  • 5. More Formal Definition of Apache Log %h %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i“ %h = IP address of the client (remote host) which made the request %l = RFC 1413 identity of the client %u = userid of the person requesting the document %t = Time that the server finished processing the request %r = Request line from the client in double quotes %s = Status code that the server sends back to the client %b = Size of the object returned to the client Referer : where the request originated User-agent what type of agent made the request. http://www.the-art-of-web.com/system/logs/
  • 6. 6 Common Response Code • 200 - OK • 206 - Partial Content • 301 - Moved Permanently • 302 - Found • 304 - Not Modified • 401 - Unauthorised (password required) • 403 - Forbidden • 404 - Not Found.
  • 7. 7 LogAnalyzer.java public class LogAnalyzer { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: loganalyzer <in> <out>"); System.exit(2); } Job job = new Job(conf, "analyze log"); job.setJarByClass(LogAnalyzer.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 8. 8 Map.java public class Map extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text url = new Text(); private Pattern p = Pattern.compile("(?:GET|POST)s([^s]+)"); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] entries = value.toString().split("r?n"); for (int i=0, len=entries.length; i<len; i+=1) { Matcher matcher = p.matcher(entries[i]); if (matcher.find()) { url.set(matcher.group(1)); context.write(url, one); } } } }
  • 9. 9 Reduce.java public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable total = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } total.set(sum); context.write(key, total); } }
  • 10. 10 Comet Cluster • Comet cluster has 1944 nodes and each node has 24 cores, built on two 12-core Intel Xeon E5-2680v3 2.5 GHz processors • 128 GB memory and 320GB SSD for local scratch space. • Attached storage: Shared 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage  Lustre Storage Area is a Parallel File System (PFS) called Data Oasis. – Users can access from /oasis/scratch/comet/$USER/temp_project Home local storage Login node /oasis
  • 11. Hadoop installation at Comet • Installed in /opt/hadoop/1.2.1 o Configure Hadoop on-demand with myHadoop:  /opt/hadoop/contrib/myHadoop/bin/myhadoop- configure.sh Home Linux Hadoop connects local storage Login node Hadoop file system is built dynamically on the nodes allocated. Deleted when the allocation is terminated.
  • 12. Compile the sample Java code at Comet Java word count example is available at Comet under /home/tyang/cs240sample/mapreduce/. • cp –r /home/tyang/cs240sample/mapreduce . • Allocate a dedicated machine for compiling  /share/apps/compute/interactive/qsubi.bash -p compute -- nodes=1 --ntasks-per-node=1 -t 00: • Change work directory to mapreduce and type make  Java code is compiled under target subdirectory Home Comet Login node
  • 13. How to Run a WordCount Mapreduce Job  Use “compute” partition for allocation  Use Java word count example at Comet under /home/tyang/cs240sample/mapreduce/.  sbatch submit-hadoop-comet.sh – Data input is in test.txt – Data output is in WC-output  Job trace is wordcount.1569018.comet-17-14.out Home Comet cluster Login node comet.sdsc.xsed e.org “compute” queue
  • 14. Sample script (submit-hadoop-comet.sh) #!/bin/bash #SBATCH --job-name="wordcount" #SBATCH --output="wordcount.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH -t 00:15:00 Export HADOOP_CONF_DIR=/home/$USER/cometcluster export WORKDIR=`pwd` module load hadoop/1.2.1 #Use myheadoop to build a Hadoop file system on allocated nodes myhadoop-configure.sh #Start all demons start-all.sh Home Linux Login node Hadoop
  • 15. Sample script #make an input directory in the hadoop file system hadoop dfs -mkdir input #copy data from local Linux file system to the Hadoop file system hadoop dfs -copyFromLocal $WORKDIR/test.txt input/ #Run Hadoop wordcount job hadoop jar $WORKDIR/wordcount.jar wordcount input/ output/ # Create a local directory WC-output to host the output data # It does not report error even the file does not exist rm -rf WC-out >/dev/null || true mkdir -p WC-out # Copy out the output data hadoop dfs -copyToLocal output/part* WC-out #Stop all demons and cleanup stop-all.sh myhadoop-cleanup.sh Home Linux Login node Hadoop
  • 16. Sample output trace wordcount.1569018.comet-17-14.out starting namenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-namenode-comet-17-14.out comet-17-14.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode- comet-17-14.sdsc.edu.out comet-17-15.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode- comet-17-15.sdsc.edu.out comet-17-14.ibnet: starting secondarynamenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang- secondarynamenode-comet-17-14.sdsc.edu.out starting jobtracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-jobtracker-comet-17-14.out comet-17-14.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker- comet-17-14.sdsc.edu.out comet-17-15.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker- comet-17-15.sdsc.edu.out
  • 17. Sample output trace wordcount.1569018.comet-17-14.out 16/01/31 17:43:44 INFO input.FileInputFormat: Total input paths to process : 1 16/01/31 17:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 16/01/31 17:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded 16/01/31 17:43:44 INFO mapred.JobClient: Running job: job_201601311743_0001 16/01/31 17:43:45 INFO mapred.JobClient: map 0% reduce 0% 16/01/31 17:43:49 INFO mapred.JobClient: map 100% reduce 0% 16/01/31 17:43:56 INFO mapred.JobClient: map 100% reduce 33% 16/01/31 17:43:57 INFO mapred.JobClient: map 100% reduce 100% 16/01/31 17:43:57 INFO mapred.JobClient: Job complete: job_201601311743_0001 comet-17-14.ibnet: stopping tasktracker comet-17-15.ibnet: stopping tasktracker stopping namenode comet-17-14.ibnet: stopping datanode comet-17-15.ibnet: stopping datanode comet-17-14.ibnet: stopping secondarynamenode Copying Hadoop logs back to /home/tyang/cometcluster/logs... `/scratch/tyang/1569018/logs' -> `/home/tyang/cometcluster/logs' Home Linux Login node Hadoop
  • 18. Sample input and output $ cat test.txt how are you today 3 4 mapreduce program 1 2 3 test send how are you mapreduce 1 send test USA california new $ cat WC-out/part-r-00000 1 2 2 1 3 2 4 1 USA 1 are 2 california 1 how 2 mapreduce 2 new 1 program 1 send 2 test 2 today 1 you 2
  • 19. Shell Commands for Hadoop File System • Mkdir, ls, cat, cp  hadoop dfs -mkdir /user/deepak/dir1  hadoop dfs -ls /user/deepak  hadoop dfs -cat /usr/deepak/file.txt  hadoop dfs -cp /user/deepak/dir1/abc.txt /user/deepak/dir2 • Copy data from the local file system to HDF  hadoop dfs -copyFromLocal <src:localFileSystem> <dest:Hdfs>  Ex: hadoop dfs –copyFromLocal /home/hduser/def.txt /user/deepak/dir1 • Copy data from HDF to local  hadoop dfs -copyToLocal <src:Hdfs> <dest:localFileSystem> http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html
  • 20. Notes • Java process listing “jps”, shows the following demons NameNode (master), SecondaryNameNode, Datanode (hadoop),JobTracker, TaskTracker • To check the status of your job squeue -u username • To cancel a submitted job scancel job-id • You have to request *all* 24 cores on the nodes. Hadoop is java based and any memory limits start causing problems. Also, in the compute partition you are charged for the whole node anyway.
  • 21. Notes • Your script should delete the outout directory if you want to rerun and copy out data to that directory. Otherwise the Hadoop copy back fails because the file already exists. The current script forces to remove "WC-output". • If you are running several Mapreduce jobs simultaneously, please make sure you choose different locations for for the configuration files. Basically change the line: export HADOOP_CONF_DIR=/home/$USER/cometcluster to point to different directories for each run. Otherwise the configuration from different jobs will overwrite in the same directory and cause problems.

Editor's Notes