SlideShare a Scribd company logo
CSE6801 Distributed Computing Systems (November 2013)
Finding specific url access pattern using map reduce framework on
a cluster of machines
Nushrat Humaira
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology
1 Introduction
The project aim is to find out specific URL access pattern on a real web log data set using MapRe-
duce library with the help of Apache Hadoop. The web access logs from the official website of FIFA
98 world cup is used as data set for our testbed setup in cluster of 6 machines. Apache Hadoop is
used to process this large data set. The count of the days of most hits by a specific URL is counted
by the MapReduce library using Apache Haddop. IP adresss is used as key for the mapper class.
The objective of the project is investigate which IP address is more frequently accessed and what
were the type of contents it viewed. The report is organizes as follows. In Section 2 we introduce
the problem followed by Section 3 where we provide a few earlier works.Section 4 explains in detail
our experimental data set and system and mapreduce job setup instructions with prerequisites are
described in Section 5.Section 6 describes experimental results. Section 7 concludes the report.
2 Problem Statement
Todays technology depends on data. Data is growing so fast and the amount is huge to process
them in a single process. Large data implies large number of processing over different large data
sets and causes lots of results. Coordination and configuration of these large number data needs
special attention and resource.
MapReduce is a programming model and this framework is used to process large set of data. User
extend the map function to process a key/value pair and generate an intermediate key/value pair.
This intermediate key/value pair with same key are merged by reduce function. This functional style
helps to process large data by reducing computation using parallelization.Interesting information
could be retrieved with MapReduce framework. For example, how many times a website is hit by
a particular ip address we can retrieve from it. The result could be used to get an enthusiast for a
particular field.
Hadoop is a open source framework to process large number of data sets in highly efficient
manner and it enables distributed processing of large data sets across clusters of commodity servers.
Hadoop enables a computing solution that is: Scalable, Cost effective, Flexible and Fault tolerant.
Apache Hadoop has two main subprojects: MapReduce and HDFS. HDFS is a filesystem inspired
from Google File System (GFS) that can store very large data sets by scaling out across a cluster of
host machines. It delivers throughput, but it has low latency and provides high availability through
replication.HDFS stores files in blocks of size 64MB typically. Each storage node runs a process
called a DataNode that manages the blocks on that host, and these are coordinated by a master
NameNode process running on a separate host.In MapReduce, input data is procressed by map
function and resultant data from map is procressed by reduce function. Hadoop provides mappers
and reducers to implement these functions and does the rest of works like execution, parallelization
and coordination itself.
Cluster of machines or parallel process in a machine could solve the problem of taking long
time to process big data. The processed data could be used whenever needed. We extended some
functions and use Hadoop framework to process the web log from the official website of FIFA 98
world cup. We tried to find out which IP address hit most of the time to the website and a specific
day when most of the IP accessed to website. From the most hit we could find out the enthusiast.
The specific day with the most hit could point out the event for which people were interested. Also
we could find out what type of content i.e image, text were requested most frequently over the
days.
3 Related Work
MapReduce is used for processing large data sets. Users extend a map function that take data as
a key/value pair and generate a set of intermediate key/value pairs. Reduce function merges all
intermediate values associated with the same intermediate key and generate a single output. The
details of partitioning the input data, scheduling the program’ s execution across a set of machines,
handling machine failures, and managing the required inter-machine communication are taken care
by the run-time system. [2].
Log file analysis is becoming a necessary task for analyzing the customers behavior in order
to improve advertising and sales as well as for datasets. Log files are getting generated very
fast. In order to analyze such large datasets, parallel processing system and reliable data storage
mechanism is needed. Virtual database system is an effective solution but it becomes inefficient
for large datasets. The Hadoop framework provides reliable data storage with Distributed File
System and MapReduce programming model. Hadoop breaks up input data and sends fractions of
the original data to several machines. HMR log analyzer will provide accurate results in minimum
response time was introduced with this approach. [4].
Stored data is increasing voluminously, methods to retrieve relative information and security
related concerns are to be addressed efficiently. Emerging concepts of big data causes security
issues. The issue of secure data transfer using the concepts of data mining in cloud environment
using hadoop mapreduce is a challenging task. With Hadoop approach security issues give excellent
result. Some experimentation done and results are analyzed and represented with respect to time
and space complexity when compared hadoop with non hadoop approach [5].
Parallel implementation of genetic algorithm using MapReduce programming paradigm was
one of the important work. Hadoop implementation of MaReduce library is used for this purpose.
Comparison of implementation was made with the previous works. Implementations are compared
in solving One Max (Bit counting) problem. This model for parallelization of genetic algorithm
shows better performances and fitness convergence, but the model has lower quality of solution
because of species problem [3].
The work of Building a distributed search system with Apache Hadoop and Lucene, analyses
the problem coming from the so called Big Data scenario. A technological and algorithmic approach
is explored to handle and calculate theses amounts of data that exceed the limit of computation of
a traditional architecture based on real-time request processing. Analysis of a singular open source
technology, Apache Hadoop implements the approach described as Map and Reduce [1].
The related works can give a overview for this project. We follow some of the techniques and
trying to implement small part to use MapReduce framework to process large datasets with the
help of Hadoop.
4 Description of Web data to be processed with Apache Hadoop
The dataset collected from official Fifa ’98 WorldCup consists of 1,352,804,107 requests in total
made between April 30, 1998 and July 26, 1998. Although access logs are commonly preserved in
log format, to reduce size and analysis time, they were converted to binary format which is:
struct request {
uint32 t timestamp;
uint32 t clientID;
uint32 t objectID;
uint32 t size;
uint8 t method;
uint8 t status;
uint8 t type;
uint8 t server;
};
The fields are described as such :
timestamp - the time of the request, stored as the number of seconds since Epoch and converted
to GMT
cliendID - a unique integer identifier for the client issuing the request where each clientID maps to
exactly one IP address which may be a proxy due to privacy concerns
objectID - a unique integer identifier for the requested URL
size - the number of bytes in the response
method - the method contained in the client’s request (e.g., GET)
status - highest order bits contain the HTTP version indicated in the client’s request (e.g., HTTP/1.0);
the remaining 6 bits indicate the response status code (e.g., 200 OK)
type - the type of file requested (e.g., HTML, IMAGE, etc)
server - indicates which server handled the request
As the access logs are archived in binary log format, to suit our need, we used the recreate tool
avalibale in tool archive provided by internet traffic archive(ITA) which converts binary log back
to it’s original log format.
5 System setup and Prerequisites
Hadoop requires a working installation of JDK 1.6 for running. We are using Apache Hadoop
version 1.1.2 for our project. All necessary paths for java and hadoop home directory needs to
be pre-configured. Hadool also requires SSH (Secure Shell) for communicating between multiple
processes on each host without needing a password. We use the following command
$ ssh-keygen
to inititate this process and store a new public key in the list of authorized keys. We also configure
necessary environment variables for Hadoop, the directory where data files are stored, network
port it listens to by editing respective script and XML files for master and slave machines. We also
format the file system of hadoop via the namenode.
Then we start our multi-node cluster by executing the command
$ start-all.sh
. This will start the HDFS daemons, NameNode is started on master and DataNode daemons
are started on all slaves. This will also start MapReduce deamons where JobTracker is started on
master and TaskTrackers are started on all slaves. Executing the follwing command will stop all
daemons.
$ stop-all.sh
Before we run any mapreduce job on our cluster we must first copy the files from our local file
system to Hadoops HDFS. If it was the first time mapreduce job is being executed on this namen-
ode, namenode needs to be formatted in the HDFS. We stored our log files in a directory named
fifa98trace in HDFS and for the sake of clarity, here are the commands to be executed.
$ bin/hadoop dfs -copyFromLocal /home/hadoop/fifa98trace /user/hadoop/fifa98trace
6 Experiment Results
Before we started our map and reduce jobs, we prepared our test bed and test data set.
• We chose a cluster of 6 machines with system configured with Intel Core2 Duo @240GH x
2,permament storage of 80GB(approx.),each with Ubuntu 12.04 -32 bit LTS installed.
• We wrote 4 different mapreduce jobs,namely MR-1,MR-2,MR-3,MR-4. We randomly chose
data from 15 days total in size of 12 GB.
• Mapper in MR-1 outputs each unique Client ID( not IP address) and their occurances. Re-
ducer groups and merges them together to ouput top 5 client ID with greatest hits.
Here is the result of MR-1 with 5 most frequent visitors(shown in Table 1).The total time
required to complete the job was 20 min 2 sec (approx.)
ClientID Hit No:
298294 228457
514370 151637
3533 148412
9054 147910
129 137905
Table 1: Result of MapReduce Job -1
• MR-2 outputs the peak hour or more precisely the interval during which the website got the
most hit over the whole tourmanent duration. Total time to output the result was 22 min 21
sec (approx.).The result is shown in Table-2.
On 30 April, 1998 , the interval 21:00:00 - 22:00:00 got the most hit :324
Table 2: Result of MapReduce Job -2
• MR-3 shows the type of content or file requested from the website over the tournament,
whether image file or text/html file were most requested or not. Total time required to com-
plete the job was 18 min 23 sec (approx.).The result is shown in Table-3.
File Type Requested No.of times
image 113695131
text/html 17106837
Table 3: Result of MapReduce Job -3
• MR-4 outputs a list of every day with their correspoding total hit, from which we can deduce
that days with most hits indicate special event occuring that day.Total time taken was 22
min 8 sec (approx.).The result is shown in Table-4.
Date Hit Count
30 Apr,1998 98759
01 May,1998 1158019
02 May,1998 692603
05 May,1998 103376
06 May,1998 1584440
08 May,1998 109996
09 May,1998 905842
17 May,1998 112111
18 May,1998 1434082
25 May,1998 244558
26 May,1998 3894595
02 Jun,1998 517042
03 Jun,1998 7934114
04 Jun,1998 8292042
05 Jun,1998 8048127
06 Jun,1998 5337589
07 Jun,1998 5848249
08 Jun,1998 15138598
09 Jun,1998 20527120
10 Jun,1998 48820687
Table 4: Result of MapReduce Job -4
7 Conclusion and Future Work
The MapReduce frameworks is used here to analyze log files which are the big data. Implementation
for the MapReduce function done by the Hadoop. We tried to write specific map and reduce jobs
to find out each of the queries , distinct IP count, time of most frequent hits, type of content in
frequent hits. We have done map and reduce jobs capable of running on a single machine and a
cluster of machines.Future works can be extended by making the reduce job parallel as like as map
jobs.
References
[1] Mirko Calvaresi. Building a distributed search system with apache hadoop and lucene. In Proc.
of ICCV, 2009.
[2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.
Operating System Designs and Implementation Conference, 2004.
[3] Dino keco and Abdulhamit Subasi. Parallelization of genetic algorithms using hadoop
map/reduce. SOUTHEAST EUROPE JOURNAL OF SOFT COMPUTING, 2012.
[4] Sayalee Narkhede and Tripti Baraskar. Hmr log analyzer: Analyze web application logs over
hadoop mapreduce. International Journal of UbiComp, July 2013.
[5] P.Srinivasa Rao, Dr. K.Thammi Reddy, and Dr. MHM.Krishna Prasad. A novel approach for
identification of hadoop cloud temporal patterns using map reduce. I.J. Information Technology
and Computer Science, 2014.

More Related Content

What's hot (20)

PPTX
Facial emotion recognition
Anukriti Dureha
 
PDF
Introduction to Python Pandas for Data Analytics
Phoenix
 
PPTX
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Taminul Islam
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PPTX
Introduction to pandas
Piyush rai
 
PDF
Windows Registry Tips & Tricks
Raghav Bisht
 
PPTX
Deep Learning for Lung Cancer Detection
Miguel González-Fierro
 
PDF
Deep learning for medical imaging
geetachauhan
 
PPTX
Blood bank management system (3).pptx
49LavanyaPrasadCST
 
PDF
What is Phishing? Phishing Attack Explained | Edureka
Edureka!
 
PDF
Heart Attack Prediction using Machine Learning
mohdshoaibuddin1
 
PPTX
E-R Diagram of College Management Systems
Omprakash Chauhan
 
PPTX
Introduction of Data Science
Jason Geng
 
PPT
Computer Forensic
Tawhidur Rahman
 
PDF
Lung Cancer Detection Using Convolutional Neural Network
IRJET Journal
 
PPTX
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
PDF
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
DOCX
Data recovery report
tutannandi
 
PPTX
Python pandas Library
Md. Sohag Miah
 
PPT
Python Pandas
Sunil OS
 
Facial emotion recognition
Anukriti Dureha
 
Introduction to Python Pandas for Data Analytics
Phoenix
 
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Taminul Islam
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Introduction to pandas
Piyush rai
 
Windows Registry Tips & Tricks
Raghav Bisht
 
Deep Learning for Lung Cancer Detection
Miguel González-Fierro
 
Deep learning for medical imaging
geetachauhan
 
Blood bank management system (3).pptx
49LavanyaPrasadCST
 
What is Phishing? Phishing Attack Explained | Edureka
Edureka!
 
Heart Attack Prediction using Machine Learning
mohdshoaibuddin1
 
E-R Diagram of College Management Systems
Omprakash Chauhan
 
Introduction of Data Science
Jason Geng
 
Computer Forensic
Tawhidur Rahman
 
Lung Cancer Detection Using Convolutional Neural Network
IRJET Journal
 
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
Survey on data mining techniques in heart disease prediction
Sivagowry Shathesh
 
Data recovery report
tutannandi
 
Python pandas Library
Md. Sohag Miah
 
Python Pandas
Sunil OS
 

Similar to Finding URL pattern with MapReduce and Apache Hadoop (20)

PDF
B017320612
IOSR Journals
 
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
PDF
A data aware caching 2415
SANTOSH WAYAL
 
PDF
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
PDF
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
PDF
G017143640
IOSR Journals
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
PPTX
Hadoop by kamran khan
KamranKhan587
 
PPTX
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
PPT
Big Data & Hadoop
Krishna Sujeer
 
DOCX
Paper ijert
SANTOSH WAYAL
 
PDF
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
PPTX
Cppt
chunkypandey12
 
PPTX
Cppt
chunkypandey12
 
PPTX
Cppt Hadoop
chunkypandey12
 
PDF
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
B017320612
IOSR Journals
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
A data aware caching 2415
SANTOSH WAYAL
 
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
G017143640
IOSR Journals
 
Managing Big data with Hadoop
Nalini Mehta
 
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Hadoop by kamran khan
KamranKhan587
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
Big Data & Hadoop
Krishna Sujeer
 
Paper ijert
SANTOSH WAYAL
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
Cppt Hadoop
chunkypandey12
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
Ad

Finding URL pattern with MapReduce and Apache Hadoop

  • 1. CSE6801 Distributed Computing Systems (November 2013) Finding specific url access pattern using map reduce framework on a cluster of machines Nushrat Humaira Department of Computer Science and Engineering Bangladesh University of Engineering and Technology 1 Introduction The project aim is to find out specific URL access pattern on a real web log data set using MapRe- duce library with the help of Apache Hadoop. The web access logs from the official website of FIFA 98 world cup is used as data set for our testbed setup in cluster of 6 machines. Apache Hadoop is used to process this large data set. The count of the days of most hits by a specific URL is counted by the MapReduce library using Apache Haddop. IP adresss is used as key for the mapper class. The objective of the project is investigate which IP address is more frequently accessed and what were the type of contents it viewed. The report is organizes as follows. In Section 2 we introduce the problem followed by Section 3 where we provide a few earlier works.Section 4 explains in detail our experimental data set and system and mapreduce job setup instructions with prerequisites are described in Section 5.Section 6 describes experimental results. Section 7 concludes the report. 2 Problem Statement Todays technology depends on data. Data is growing so fast and the amount is huge to process them in a single process. Large data implies large number of processing over different large data sets and causes lots of results. Coordination and configuration of these large number data needs special attention and resource. MapReduce is a programming model and this framework is used to process large set of data. User extend the map function to process a key/value pair and generate an intermediate key/value pair. This intermediate key/value pair with same key are merged by reduce function. This functional style helps to process large data by reducing computation using parallelization.Interesting information could be retrieved with MapReduce framework. For example, how many times a website is hit by a particular ip address we can retrieve from it. The result could be used to get an enthusiast for a particular field. Hadoop is a open source framework to process large number of data sets in highly efficient manner and it enables distributed processing of large data sets across clusters of commodity servers. Hadoop enables a computing solution that is: Scalable, Cost effective, Flexible and Fault tolerant. Apache Hadoop has two main subprojects: MapReduce and HDFS. HDFS is a filesystem inspired from Google File System (GFS) that can store very large data sets by scaling out across a cluster of host machines. It delivers throughput, but it has low latency and provides high availability through replication.HDFS stores files in blocks of size 64MB typically. Each storage node runs a process called a DataNode that manages the blocks on that host, and these are coordinated by a master NameNode process running on a separate host.In MapReduce, input data is procressed by map function and resultant data from map is procressed by reduce function. Hadoop provides mappers
  • 2. and reducers to implement these functions and does the rest of works like execution, parallelization and coordination itself. Cluster of machines or parallel process in a machine could solve the problem of taking long time to process big data. The processed data could be used whenever needed. We extended some functions and use Hadoop framework to process the web log from the official website of FIFA 98 world cup. We tried to find out which IP address hit most of the time to the website and a specific day when most of the IP accessed to website. From the most hit we could find out the enthusiast. The specific day with the most hit could point out the event for which people were interested. Also we could find out what type of content i.e image, text were requested most frequently over the days. 3 Related Work MapReduce is used for processing large data sets. Users extend a map function that take data as a key/value pair and generate a set of intermediate key/value pairs. Reduce function merges all intermediate values associated with the same intermediate key and generate a single output. The details of partitioning the input data, scheduling the program’ s execution across a set of machines, handling machine failures, and managing the required inter-machine communication are taken care by the run-time system. [2]. Log file analysis is becoming a necessary task for analyzing the customers behavior in order to improve advertising and sales as well as for datasets. Log files are getting generated very fast. In order to analyze such large datasets, parallel processing system and reliable data storage mechanism is needed. Virtual database system is an effective solution but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage with Distributed File System and MapReduce programming model. Hadoop breaks up input data and sends fractions of the original data to several machines. HMR log analyzer will provide accurate results in minimum response time was introduced with this approach. [4]. Stored data is increasing voluminously, methods to retrieve relative information and security related concerns are to be addressed efficiently. Emerging concepts of big data causes security issues. The issue of secure data transfer using the concepts of data mining in cloud environment using hadoop mapreduce is a challenging task. With Hadoop approach security issues give excellent result. Some experimentation done and results are analyzed and represented with respect to time and space complexity when compared hadoop with non hadoop approach [5]. Parallel implementation of genetic algorithm using MapReduce programming paradigm was one of the important work. Hadoop implementation of MaReduce library is used for this purpose. Comparison of implementation was made with the previous works. Implementations are compared in solving One Max (Bit counting) problem. This model for parallelization of genetic algorithm shows better performances and fitness convergence, but the model has lower quality of solution because of species problem [3]. The work of Building a distributed search system with Apache Hadoop and Lucene, analyses the problem coming from the so called Big Data scenario. A technological and algorithmic approach is explored to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing. Analysis of a singular open source technology, Apache Hadoop implements the approach described as Map and Reduce [1]. The related works can give a overview for this project. We follow some of the techniques and trying to implement small part to use MapReduce framework to process large datasets with the help of Hadoop.
  • 3. 4 Description of Web data to be processed with Apache Hadoop The dataset collected from official Fifa ’98 WorldCup consists of 1,352,804,107 requests in total made between April 30, 1998 and July 26, 1998. Although access logs are commonly preserved in log format, to reduce size and analysis time, they were converted to binary format which is: struct request { uint32 t timestamp; uint32 t clientID; uint32 t objectID; uint32 t size; uint8 t method; uint8 t status; uint8 t type; uint8 t server; }; The fields are described as such : timestamp - the time of the request, stored as the number of seconds since Epoch and converted to GMT cliendID - a unique integer identifier for the client issuing the request where each clientID maps to exactly one IP address which may be a proxy due to privacy concerns objectID - a unique integer identifier for the requested URL size - the number of bytes in the response method - the method contained in the client’s request (e.g., GET) status - highest order bits contain the HTTP version indicated in the client’s request (e.g., HTTP/1.0); the remaining 6 bits indicate the response status code (e.g., 200 OK) type - the type of file requested (e.g., HTML, IMAGE, etc) server - indicates which server handled the request As the access logs are archived in binary log format, to suit our need, we used the recreate tool avalibale in tool archive provided by internet traffic archive(ITA) which converts binary log back to it’s original log format. 5 System setup and Prerequisites Hadoop requires a working installation of JDK 1.6 for running. We are using Apache Hadoop version 1.1.2 for our project. All necessary paths for java and hadoop home directory needs to be pre-configured. Hadool also requires SSH (Secure Shell) for communicating between multiple processes on each host without needing a password. We use the following command $ ssh-keygen to inititate this process and store a new public key in the list of authorized keys. We also configure necessary environment variables for Hadoop, the directory where data files are stored, network port it listens to by editing respective script and XML files for master and slave machines. We also format the file system of hadoop via the namenode. Then we start our multi-node cluster by executing the command $ start-all.sh . This will start the HDFS daemons, NameNode is started on master and DataNode daemons are started on all slaves. This will also start MapReduce deamons where JobTracker is started on master and TaskTrackers are started on all slaves. Executing the follwing command will stop all daemons. $ stop-all.sh
  • 4. Before we run any mapreduce job on our cluster we must first copy the files from our local file system to Hadoops HDFS. If it was the first time mapreduce job is being executed on this namen- ode, namenode needs to be formatted in the HDFS. We stored our log files in a directory named fifa98trace in HDFS and for the sake of clarity, here are the commands to be executed. $ bin/hadoop dfs -copyFromLocal /home/hadoop/fifa98trace /user/hadoop/fifa98trace 6 Experiment Results Before we started our map and reduce jobs, we prepared our test bed and test data set. • We chose a cluster of 6 machines with system configured with Intel Core2 Duo @240GH x 2,permament storage of 80GB(approx.),each with Ubuntu 12.04 -32 bit LTS installed. • We wrote 4 different mapreduce jobs,namely MR-1,MR-2,MR-3,MR-4. We randomly chose data from 15 days total in size of 12 GB. • Mapper in MR-1 outputs each unique Client ID( not IP address) and their occurances. Re- ducer groups and merges them together to ouput top 5 client ID with greatest hits. Here is the result of MR-1 with 5 most frequent visitors(shown in Table 1).The total time required to complete the job was 20 min 2 sec (approx.) ClientID Hit No: 298294 228457 514370 151637 3533 148412 9054 147910 129 137905 Table 1: Result of MapReduce Job -1 • MR-2 outputs the peak hour or more precisely the interval during which the website got the most hit over the whole tourmanent duration. Total time to output the result was 22 min 21 sec (approx.).The result is shown in Table-2. On 30 April, 1998 , the interval 21:00:00 - 22:00:00 got the most hit :324 Table 2: Result of MapReduce Job -2 • MR-3 shows the type of content or file requested from the website over the tournament, whether image file or text/html file were most requested or not. Total time required to com- plete the job was 18 min 23 sec (approx.).The result is shown in Table-3. File Type Requested No.of times image 113695131 text/html 17106837 Table 3: Result of MapReduce Job -3
  • 5. • MR-4 outputs a list of every day with their correspoding total hit, from which we can deduce that days with most hits indicate special event occuring that day.Total time taken was 22 min 8 sec (approx.).The result is shown in Table-4. Date Hit Count 30 Apr,1998 98759 01 May,1998 1158019 02 May,1998 692603 05 May,1998 103376 06 May,1998 1584440 08 May,1998 109996 09 May,1998 905842 17 May,1998 112111 18 May,1998 1434082 25 May,1998 244558 26 May,1998 3894595 02 Jun,1998 517042 03 Jun,1998 7934114 04 Jun,1998 8292042 05 Jun,1998 8048127 06 Jun,1998 5337589 07 Jun,1998 5848249 08 Jun,1998 15138598 09 Jun,1998 20527120 10 Jun,1998 48820687 Table 4: Result of MapReduce Job -4 7 Conclusion and Future Work The MapReduce frameworks is used here to analyze log files which are the big data. Implementation for the MapReduce function done by the Hadoop. We tried to write specific map and reduce jobs to find out each of the queries , distinct IP count, time of most frequent hits, type of content in frequent hits. We have done map and reduce jobs capable of running on a single machine and a cluster of machines.Future works can be extended by making the reduce job parallel as like as map jobs. References [1] Mirko Calvaresi. Building a distributed search system with apache hadoop and lucene. In Proc. of ICCV, 2009. [2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Operating System Designs and Implementation Conference, 2004. [3] Dino keco and Abdulhamit Subasi. Parallelization of genetic algorithms using hadoop map/reduce. SOUTHEAST EUROPE JOURNAL OF SOFT COMPUTING, 2012. [4] Sayalee Narkhede and Tripti Baraskar. Hmr log analyzer: Analyze web application logs over hadoop mapreduce. International Journal of UbiComp, July 2013.
  • 6. [5] P.Srinivasa Rao, Dr. K.Thammi Reddy, and Dr. MHM.Krishna Prasad. A novel approach for identification of hadoop cloud temporal patterns using map reduce. I.J. Information Technology and Computer Science, 2014.