My First Hadoop Program !!!

BIG Data
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both
structured and unstructured data that is so large it is difficult to process using traditional
database and software techniques. In most enterprise scenarios the volume of data is too
big or it moves too fast or it exceeds current processing capacity. Despite these problems,
big data has the potential to help companies improve operations and make faster, more
intelligent decisions.
Big Data: Volume or a Technology?
While the term may seem to reference the volume of data, that isn't always the case. The
term big data, especially when used by vendors, may refer to the technology (which
includes tools and processes) that an organization requires to handle the large amounts of
data and storage facilities. The term big data is believed to have originated with Web search
companies who needed to query very large distributed aggregations of loosely-structured
data.
An Example of Big Data
An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes)
of data consisting of billions to trillions of records of millions of people—all from different
sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The
data is typically loosely structured data that is often incomplete and inaccessible.
Byte of Data : One grain of rice
Kilobyte : Cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 containers lorries
Terabyte : 2 containers ships
Petabyte : Covers manhattan

Exabyte : Covers UK 3 times
Zettabyte : Fills the pacific ocean
For more insights :
http://www.webopedia.com/TERM/B/big_data_analytics.html
http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html
http://www.datamation.com/applications/big-data-analytics-overview.html
My first program in Hadoop
Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power, the minimum requirements for running Hadoop is a 6 GB+ ram memory
installed machine.
Step 1: Installing Oracle Virtual Box and Hadoop
Oracle virtual box is needed to run the host server for Hadoop.
Hadoop setup can be downloaded from the following link :
http://hortonworks.com/hdp/downloads/
The hortonworks, is a company that focuses on the development and support of Apache
Hadoop, a framework that allows for the distributed processing of large data sets across
clusters of computers.
For more info : http://hortonworks.com/hadoop/
The oracle screen post the successful installation of hadoop will look as below:
Step 2: Starting Hadoop
Once the installation is complete, open the hortonworks platform. A new window will open
like the one shown below
After hadoop is installed the Hortonworks Sandbox session should be accessed typing the
link in a new web tab as in the below screenshot:

Step 3: Logging intosandbox
Login: root
Password: Hadoop
Step 4: Creating the directory
A directory should be created into Shellscript (SSH) by the following code :
mkdir WCclasses
Step 5: Running the given programs required todo the specified tasks
Files needed for the files can be downloaded from the following link :
https://www.dropbox.com/s/s1pirjdqr8wf4jy/JavaWordCount.zip?dl=0
Run the following three programs
#Sum reduce.java

#WordMapper.java
Step 4: Uploading java program to Shell script
All the above are loaded into the shell script using the following codes :

After each java program is ran one by one it is saved by coming out of the root directory,
and typing
Vi <program name>.java
The program executed will be saved under the directory
Step 5: Shell script for compiling, running the java program
The java programe saved under the directory WCclasses , after compilation is saved under
programe(name).class jar name as under:
From the above screen shot we can see all 3 jar files have been complied and saved under
SumReducer.class, WordCount.class, WordMapper.class file names respectively. Post
compilation the files are deflated as in the screenshot above.
Step 6: Reflection of the hadoop libraries in HDP distribution
The program complied as above can be executed in the HDP distribution by the following
code —
hdfs dfs -ls /user/hue
hdfs dfs -ls /user/hue/wc – inp
hdfs dfs -rm -r /user/hue/wc-out2

-The jar files so saved for word counting are reflected in the hadoop libraries by the
following code–(as in screenshot above):
hadoop jar WordCount.jar WordCount /user/hue/wc-inp /user/hue/wc-out2
Step 7: Uploading the text files into the directory using HUE
Post step 6, one has to log into HUE with the url, username & password given for Hue as
under on the screenshot:

Step 8: Modification of the shell scripts to point to correct input and output directories
After login into HUE, the correct shell scripts have to be defined to point out input & output
directories, the same is done using the file browser in HUE screen tab as under:
Step 9: Compilation & Execution of the Java programs
The wc-out2 is the file output directory the Word count program executed is saved for
output in this directory as under:

Step 10: WordCount program output
The wordcount output post the compilation of the 3 java programs can be seen under the
Job Browser tab of Hue using the username root, through the same we can view the java
complied program is succeeded and the same is saved under WordCount.jar directory for
output as under:
The Wordcount of the 3 java program complied in steps above is in hadoop
hue wc-out2/part-r-00000 file path as under:

We can see from the screenshots the word count from the java programs Sumreducer.java,
Wordmapper.java & Wordcount.java.

My First Hadoop Program !!!

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to My First Hadoop Program !!! (20)

Recently uploaded (20)

My First Hadoop Program !!!