Finding URL pattern with MapReduce and Apache Hadoop

CSE6801 Distributed Computing Systems (November 2013)
Finding specific url access pattern using map reduce framework on
a cluster of machines
Nushrat Humaira
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology
1 Introduction
The project aim is to find out specific URL access pattern on a real web log data set using MapRe-
duce library with the help of Apache Hadoop. The web access logs from the official website of FIFA
98 world cup is used as data set for our testbed setup in cluster of 6 machines. Apache Hadoop is
used to process this large data set. The count of the days of most hits by a specific URL is counted
by the MapReduce library using Apache Haddop. IP adresss is used as key for the mapper class.
The objective of the project is investigate which IP address is more frequently accessed and what
were the type of contents it viewed. The report is organizes as follows. In Section 2 we introduce
the problem followed by Section 3 where we provide a few earlier works.Section 4 explains in detail
our experimental data set and system and mapreduce job setup instructions with prerequisites are
described in Section 5.Section 6 describes experimental results. Section 7 concludes the report.
2 Problem Statement
Todays technology depends on data. Data is growing so fast and the amount is huge to process
them in a single process. Large data implies large number of processing over different large data
sets and causes lots of results. Coordination and configuration of these large number data needs
special attention and resource.
MapReduce is a programming model and this framework is used to process large set of data. User
extend the map function to process a key/value pair and generate an intermediate key/value pair.
This intermediate key/value pair with same key are merged by reduce function. This functional style
helps to process large data by reducing computation using parallelization.Interesting information
could be retrieved with MapReduce framework. For example, how many times a website is hit by
a particular ip address we can retrieve from it. The result could be used to get an enthusiast for a
particular field.
Hadoop is a open source framework to process large number of data sets in highly efficient
manner and it enables distributed processing of large data sets across clusters of commodity servers.
Hadoop enables a computing solution that is: Scalable, Cost effective, Flexible and Fault tolerant.
Apache Hadoop has two main subprojects: MapReduce and HDFS. HDFS is a filesystem inspired
from Google File System (GFS) that can store very large data sets by scaling out across a cluster of
host machines. It delivers throughput, but it has low latency and provides high availability through
replication.HDFS stores files in blocks of size 64MB typically. Each storage node runs a process
called a DataNode that manages the blocks on that host, and these are coordinated by a master
NameNode process running on a separate host.In MapReduce, input data is procressed by map
function and resultant data from map is procressed by reduce function. Hadoop provides mappers

and reducers to implement these functions and does the rest of works like execution, parallelization
and coordination itself.
Cluster of machines or parallel process in a machine could solve the problem of taking long
time to process big data. The processed data could be used whenever needed. We extended some
functions and use Hadoop framework to process the web log from the official website of FIFA 98
world cup. We tried to find out which IP address hit most of the time to the website and a specific
day when most of the IP accessed to website. From the most hit we could find out the enthusiast.
The specific day with the most hit could point out the event for which people were interested. Also
we could find out what type of content i.e image, text were requested most frequently over the
days.
3 Related Work
MapReduce is used for processing large data sets. Users extend a map function that take data as
a key/value pair and generate a set of intermediate key/value pairs. Reduce function merges all
intermediate values associated with the same intermediate key and generate a single output. The
details of partitioning the input data, scheduling the program’ s execution across a set of machines,
handling machine failures, and managing the required inter-machine communication are taken care
by the run-time system. [2].
Log file analysis is becoming a necessary task for analyzing the customers behavior in order
to improve advertising and sales as well as for datasets. Log files are getting generated very
fast. In order to analyze such large datasets, parallel processing system and reliable data storage
mechanism is needed. Virtual database system is an effective solution but it becomes inefficient
for large datasets. The Hadoop framework provides reliable data storage with Distributed File
System and MapReduce programming model. Hadoop breaks up input data and sends fractions of
the original data to several machines. HMR log analyzer will provide accurate results in minimum
response time was introduced with this approach. [4].
Stored data is increasing voluminously, methods to retrieve relative information and security
related concerns are to be addressed efficiently. Emerging concepts of big data causes security
issues. The issue of secure data transfer using the concepts of data mining in cloud environment
using hadoop mapreduce is a challenging task. With Hadoop approach security issues give excellent
result. Some experimentation done and results are analyzed and represented with respect to time
and space complexity when compared hadoop with non hadoop approach [5].
Parallel implementation of genetic algorithm using MapReduce programming paradigm was
one of the important work. Hadoop implementation of MaReduce library is used for this purpose.
Comparison of implementation was made with the previous works. Implementations are compared
in solving One Max (Bit counting) problem. This model for parallelization of genetic algorithm
shows better performances and fitness convergence, but the model has lower quality of solution
because of species problem [3].
The work of Building a distributed search system with Apache Hadoop and Lucene, analyses
the problem coming from the so called Big Data scenario. A technological and algorithmic approach
is explored to handle and calculate theses amounts of data that exceed the limit of computation of
a traditional architecture based on real-time request processing. Analysis of a singular open source
technology, Apache Hadoop implements the approach described as Map and Reduce [1].
The related works can give a overview for this project. We follow some of the techniques and
trying to implement small part to use MapReduce framework to process large datasets with the
help of Hadoop.

4 Description of Web data to be processed with Apache Hadoop
The dataset collected from official Fifa ’98 WorldCup consists of 1,352,804,107 requests in total
made between April 30, 1998 and July 26, 1998. Although access logs are commonly preserved in
log format, to reduce size and analysis time, they were converted to binary format which is:
struct request {
uint32 t timestamp;
uint32 t clientID;
uint32 t objectID;
uint32 t size;
uint8 t method;
uint8 t status;
uint8 t type;
uint8 t server;
};
The fields are described as such :
timestamp - the time of the request, stored as the number of seconds since Epoch and converted
to GMT
cliendID - a unique integer identifier for the client issuing the request where each clientID maps to
exactly one IP address which may be a proxy due to privacy concerns
objectID - a unique integer identifier for the requested URL
size - the number of bytes in the response
method - the method contained in the client’s request (e.g., GET)
status - highest order bits contain the HTTP version indicated in the client’s request (e.g., HTTP/1.0);
the remaining 6 bits indicate the response status code (e.g., 200 OK)
type - the type of file requested (e.g., HTML, IMAGE, etc)
server - indicates which server handled the request
As the access logs are archived in binary log format, to suit our need, we used the recreate tool
avalibale in tool archive provided by internet traffic archive(ITA) which converts binary log back
to it’s original log format.
5 System setup and Prerequisites
Hadoop requires a working installation of JDK 1.6 for running. We are using Apache Hadoop
version 1.1.2 for our project. All necessary paths for java and hadoop home directory needs to
be pre-configured. Hadool also requires SSH (Secure Shell) for communicating between multiple
processes on each host without needing a password. We use the following command
$ ssh-keygen
to inititate this process and store a new public key in the list of authorized keys. We also configure
necessary environment variables for Hadoop, the directory where data files are stored, network
port it listens to by editing respective script and XML files for master and slave machines. We also
format the file system of hadoop via the namenode.
Then we start our multi-node cluster by executing the command
$ start-all.sh
. This will start the HDFS daemons, NameNode is started on master and DataNode daemons
are started on all slaves. This will also start MapReduce deamons where JobTracker is started on
master and TaskTrackers are started on all slaves. Executing the follwing command will stop all
daemons.
$ stop-all.sh

Before we run any mapreduce job on our cluster we must first copy the files from our local file
system to Hadoops HDFS. If it was the first time mapreduce job is being executed on this namen-
ode, namenode needs to be formatted in the HDFS. We stored our log files in a directory named
fifa98trace in HDFS and for the sake of clarity, here are the commands to be executed.
$ bin/hadoop dfs -copyFromLocal /home/hadoop/fifa98trace /user/hadoop/fifa98trace
6 Experiment Results
Before we started our map and reduce jobs, we prepared our test bed and test data set.
• We chose a cluster of 6 machines with system configured with Intel Core2 Duo @240GH x
2,permament storage of 80GB(approx.),each with Ubuntu 12.04 -32 bit LTS installed.
• We wrote 4 different mapreduce jobs,namely MR-1,MR-2,MR-3,MR-4. We randomly chose
data from 15 days total in size of 12 GB.
• Mapper in MR-1 outputs each unique Client ID( not IP address) and their occurances. Re-
ducer groups and merges them together to ouput top 5 client ID with greatest hits.
Here is the result of MR-1 with 5 most frequent visitors(shown in Table 1).The total time
required to complete the job was 20 min 2 sec (approx.)
ClientID Hit No:
298294 228457
514370 151637
3533 148412
9054 147910
129 137905
Table 1: Result of MapReduce Job -1
• MR-2 outputs the peak hour or more precisely the interval during which the website got the
most hit over the whole tourmanent duration. Total time to output the result was 22 min 21
sec (approx.).The result is shown in Table-2.
On 30 April, 1998 , the interval 21:00:00 - 22:00:00 got the most hit :324
• MR-3 shows the type of content or file requested from the website over the tournament,
whether image file or text/html file were most requested or not. Total time required to com-
plete the job was 18 min 23 sec (approx.).The result is shown in Table-3.
File Type Requested No.of times
image 113695131
text/html 17106837

• MR-4 outputs a list of every day with their correspoding total hit, from which we can deduce
that days with most hits indicate special event occuring that day.Total time taken was 22
min 8 sec (approx.).The result is shown in Table-4.
Date Hit Count
30 Apr,1998 98759
01 May,1998 1158019
02 May,1998 692603
05 May,1998 103376
06 May,1998 1584440
08 May,1998 109996
09 May,1998 905842
17 May,1998 112111
18 May,1998 1434082
25 May,1998 244558
26 May,1998 3894595
02 Jun,1998 517042
03 Jun,1998 7934114
04 Jun,1998 8292042
05 Jun,1998 8048127
06 Jun,1998 5337589
07 Jun,1998 5848249
08 Jun,1998 15138598
09 Jun,1998 20527120
10 Jun,1998 48820687
7 Conclusion and Future Work
The MapReduce frameworks is used here to analyze log files which are the big data. Implementation
for the MapReduce function done by the Hadoop. We tried to write specific map and reduce jobs
to find out each of the queries , distinct IP count, time of most frequent hits, type of content in
frequent hits. We have done map and reduce jobs capable of running on a single machine and a
cluster of machines.Future works can be extended by making the reduce job parallel as like as map
jobs.
References
[1] Mirko Calvaresi. Building a distributed search system with apache hadoop and lucene. In Proc.
of ICCV, 2009.
[2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.
Operating System Designs and Implementation Conference, 2004.
[3] Dino keco and Abdulhamit Subasi. Parallelization of genetic algorithms using hadoop
map/reduce. SOUTHEAST EUROPE JOURNAL OF SOFT COMPUTING, 2012.
[4] Sayalee Narkhede and Tripti Baraskar. Hmr log analyzer: Analyze web application logs over
hadoop mapreduce. International Journal of UbiComp, July 2013.

[5] P.Srinivasa Rao, Dr. K.Thammi Reddy, and Dr. MHM.Krishna Prasad. A novel approach for
identiﬁcation of hadoop cloud temporal patterns using map reduce. I.J. Information Technology
and Computer Science, 2014.

Finding URL pattern with MapReduce and Apache Hadoop

More Related Content

What's hot (20)

Similar to Finding URL pattern with MapReduce and Apache Hadoop (20)

Finding URL pattern with MapReduce and Apache Hadoop