SlideShare a Scribd company logo
BIG Data
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both
structured and unstructured data that is so large it is difficult to process using traditional
database and software techniques. In most enterprise scenarios the volume of data is too
big or it moves too fast or it exceeds current processing capacity. Despite these problems,
big data has the potential to help companies improve operations and make faster, more
intelligent decisions.
Big Data: Volume or a Technology?
While the term may seem to reference the volume of data, that isn't always the case. The
term big data, especially when used by vendors, may refer to the technology (which
includes tools and processes) that an organization requires to handle the large amounts of
data and storage facilities. The term big data is believed to have originated with Web search
companies who needed to query very large distributed aggregations of loosely-structured
data.
An Example of Big Data
An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes)
of data consisting of billions to trillions of records of millions of people—all from different
sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The
data is typically loosely structured data that is often incomplete and inaccessible.
Byte of Data : One grain of rice
Kilobyte : Cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 containers lorries
Terabyte : 2 containers ships
Petabyte : Covers manhattan
Exabyte : Covers UK 3 times
Zettabyte : Fills the pacific ocean
For more insights :
http://www.webopedia.com/TERM/B/big_data_analytics.html
http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html
http://www.datamation.com/applications/big-data-analytics-overview.html
My first program in Hadoop
Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power, the minimum requirements for running Hadoop is a 6 GB+ ram memory
installed machine.
Step 1: Installing Oracle Virtual Box and Hadoop
Oracle virtual box is needed to run the host server for Hadoop.
Hadoop setup can be downloaded from the following link :
http://hortonworks.com/hdp/downloads/
The hortonworks, is a company that focuses on the development and support of Apache
Hadoop, a framework that allows for the distributed processing of large data sets across
clusters of computers.
For more info : http://hortonworks.com/hadoop/
The oracle screen post the successful installation of hadoop will look as below:
Step 2: Starting Hadoop
Once the installation is complete, open the hortonworks platform. A new window will open
like the one shown below
After hadoop is installed the Hortonworks Sandbox session should be accessed typing the
link in a new web tab as in the below screenshot:
Step 3: Logging intosandbox
Login: root
Password: Hadoop
Step 4: Creating the directory
A directory should be created into Shellscript (SSH) by the following code :
mkdir WCclasses
Step 5: Running the given programs required todo the specified tasks
Files needed for the files can be downloaded from the following link :
https://www.dropbox.com/s/s1pirjdqr8wf4jy/JavaWordCount.zip?dl=0
Run the following three programs
#Sum reduce.java
#WordCount.java
#WordMapper.java
Step 4: Uploading java program to Shell script
All the above are loaded into the shell script using the following codes :
My First Hadoop Program !!!
After each java program is ran one by one it is saved by coming out of the root directory,
and typing
Vi <program name>.java
The program executed will be saved under the directory
Step 5: Shell script for compiling, running the java program
The java programe saved under the directory WCclasses , after compilation is saved under
programe(name).class jar name as under:
From the above screen shot we can see all 3 jar files have been complied and saved under
SumReducer.class, WordCount.class, WordMapper.class file names respectively. Post
compilation the files are deflated as in the screenshot above.
Step 6: Reflection of the hadoop libraries in HDP distribution
The program complied as above can be executed in the HDP distribution by the following
code —
hdfs dfs -ls /user/hue
hdfs dfs -ls /user/hue/wc – inp
hdfs dfs -rm -r /user/hue/wc-out2
-The jar files so saved for word counting are reflected in the hadoop libraries by the
following code–(as in screenshot above):
hadoop jar WordCount.jar WordCount /user/hue/wc-inp /user/hue/wc-out2
Step 7: Uploading the text files into the directory using HUE
Post step 6, one has to log into HUE with the url, username & password given for Hue as
under on the screenshot:
Step 8: Modification of the shell scripts to point to correct input and output directories
After login into HUE, the correct shell scripts have to be defined to point out input & output
directories, the same is done using the file browser in HUE screen tab as under:
Step 9: Compilation & Execution of the Java programs
The wc-out2 is the file output directory the Word count program executed is saved for
output in this directory as under:
Step 10: WordCount program output
The wordcount output post the compilation of the 3 java programs can be seen under the
Job Browser tab of Hue using the username root, through the same we can view the java
complied program is succeeded and the same is saved under WordCount.jar directory for
output as under:
The Wordcount of the 3 java program complied in steps above is in hadoop
hue wc-out2/part-r-00000 file path as under:
My First Hadoop Program !!!
We can see from the screenshots the word count from the java programs Sumreducer.java,
Wordmapper.java & Wordcount.java.

More Related Content

What's hot (20)

PPTX
Unit 5-apache hive
vishal choudhary
 
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PPTX
Apache hive
Vaibhav Kadu
 
PPTX
Apache Hive
Abhishek Gautam
 
PDF
Hadoop-BigData
Gigin Krishnan
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
ODT
ACADGILD:: HADOOP LESSON
Padma shree. T
 
PPTX
Advanced topics in hive
Uday Vakalapudi
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Hive Hadoop
Farafekr Technology Ltd.
 
PPTX
Hive training
Venkateswaran Kandasamy
 
PPTX
January 2011 HUG: Howl Presentation
Yahoo Developer Network
 
PPTX
Apache Hive
tusharsinghal58
 
PDF
Import Database Data using RODBC in R Studio
Rupak Roy
 
PDF
Introduction to Hbase
Rupak Roy
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PPT
Hadoop hive presentation
Arvind Kumar
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PDF
Import and Export Big Data using R Studio
Rupak Roy
 
Unit 5-apache hive
vishal choudhary
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Yahoo Developer Network
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Apache hive
Vaibhav Kadu
 
Apache Hive
Abhishek Gautam
 
Hadoop-BigData
Gigin Krishnan
 
Hadoop and Hive Development at Facebook
elliando dias
 
ACADGILD:: HADOOP LESSON
Padma shree. T
 
Advanced topics in hive
Uday Vakalapudi
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
January 2011 HUG: Howl Presentation
Yahoo Developer Network
 
Apache Hive
tusharsinghal58
 
Import Database Data using RODBC in R Studio
Rupak Roy
 
Introduction to Hbase
Rupak Roy
 
Introduction to scoop and its functions
Rupak Roy
 
Hadoop hive presentation
Arvind Kumar
 
Hive(ppt)
Abhinav Tyagi
 
Import and Export Big Data using R Studio
Rupak Roy
 

Viewers also liked (6)

PDF
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
Ayapparaj SKS
 
PDF
SAS Ron Cody Solutions for even Number problems from Chapter 7 to 15
Ayapparaj SKS
 
PPT
Where Vs If Statement
Sunil Gupta
 
PPTX
BAS 150 Lesson 6 Lecture
Wake Tech BAS
 
PDF
Learning SAS With Example by Ron Cody :Chapter 16 to Chapter 20 Solution
Vibeesh CS
 
PDF
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Vibeesh CS
 
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
Ayapparaj SKS
 
SAS Ron Cody Solutions for even Number problems from Chapter 7 to 15
Ayapparaj SKS
 
Where Vs If Statement
Sunil Gupta
 
BAS 150 Lesson 6 Lecture
Wake Tech BAS
 
Learning SAS With Example by Ron Cody :Chapter 16 to Chapter 20 Solution
Vibeesh CS
 
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Vibeesh CS
 
Ad

Similar to My First Hadoop Program !!! (20)

PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPTX
Big data week presentation
Joseph Adler
 
PDF
Big data using Hadoop, Hive, Sqoop with Installation
mellempudilavanya999
 
PPTX
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Hadoop Tutorial with @techmilind
EMC
 
KEY
Hadoop london
Yahoo Developer Network
 
PPTX
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPT
Another Intro To Hadoop
Adeel Ahmad
 
PDF
Learning How to Learn Hadoop
Silicon Halton
 
PDF
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
PDF
LAB PROGRAM-9zxcvbnmzxcvbnzxcvbnxcvbn.pdf
Roja40
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PDF
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
PPTX
map Reduce.pptx
habibaabderrahim1
 
PDF
Hadoop breizhjug
David Morin
 
Hands on Hadoop and pig
Sudar Muthu
 
Big data week presentation
Joseph Adler
 
Big data using Hadoop, Hive, Sqoop with Installation
mellempudilavanya999
 
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop Overview & Architecture
EMC
 
Hadoop Tutorial with @techmilind
EMC
 
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Another Intro To Hadoop
Adeel Ahmad
 
Learning How to Learn Hadoop
Silicon Halton
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
LAB PROGRAM-9zxcvbnmzxcvbnzxcvbnxcvbn.pdf
Roja40
 
Hadoop and Big Data
Harshdeep Kaur
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
map Reduce.pptx
habibaabderrahim1
 
Hadoop breizhjug
David Morin
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 

My First Hadoop Program !!!

  • 1. BIG Data Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. Despite these problems, big data has the potential to help companies improve operations and make faster, more intelligent decisions. Big Data: Volume or a Technology? While the term may seem to reference the volume of data, that isn't always the case. The term big data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities. The term big data is believed to have originated with Web search companies who needed to query very large distributed aggregations of loosely-structured data. An Example of Big Data An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. Byte of Data : One grain of rice Kilobyte : Cup of rice Megabyte : 8 bags of rice Gigabyte : 3 containers lorries Terabyte : 2 containers ships Petabyte : Covers manhattan
  • 2. Exabyte : Covers UK 3 times Zettabyte : Fills the pacific ocean For more insights : http://www.webopedia.com/TERM/B/big_data_analytics.html http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html http://www.datamation.com/applications/big-data-analytics-overview.html My first program in Hadoop Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, the minimum requirements for running Hadoop is a 6 GB+ ram memory installed machine. Step 1: Installing Oracle Virtual Box and Hadoop Oracle virtual box is needed to run the host server for Hadoop. Hadoop setup can be downloaded from the following link : http://hortonworks.com/hdp/downloads/ The hortonworks, is a company that focuses on the development and support of Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. For more info : http://hortonworks.com/hadoop/ The oracle screen post the successful installation of hadoop will look as below: Step 2: Starting Hadoop Once the installation is complete, open the hortonworks platform. A new window will open like the one shown below After hadoop is installed the Hortonworks Sandbox session should be accessed typing the link in a new web tab as in the below screenshot:
  • 3. Step 3: Logging intosandbox Login: root Password: Hadoop Step 4: Creating the directory A directory should be created into Shellscript (SSH) by the following code : mkdir WCclasses Step 5: Running the given programs required todo the specified tasks Files needed for the files can be downloaded from the following link : https://www.dropbox.com/s/s1pirjdqr8wf4jy/JavaWordCount.zip?dl=0 Run the following three programs #Sum reduce.java
  • 5. #WordMapper.java Step 4: Uploading java program to Shell script All the above are loaded into the shell script using the following codes :
  • 7. After each java program is ran one by one it is saved by coming out of the root directory, and typing Vi <program name>.java The program executed will be saved under the directory Step 5: Shell script for compiling, running the java program The java programe saved under the directory WCclasses , after compilation is saved under programe(name).class jar name as under: From the above screen shot we can see all 3 jar files have been complied and saved under SumReducer.class, WordCount.class, WordMapper.class file names respectively. Post compilation the files are deflated as in the screenshot above. Step 6: Reflection of the hadoop libraries in HDP distribution The program complied as above can be executed in the HDP distribution by the following code — hdfs dfs -ls /user/hue hdfs dfs -ls /user/hue/wc – inp hdfs dfs -rm -r /user/hue/wc-out2
  • 8. -The jar files so saved for word counting are reflected in the hadoop libraries by the following code–(as in screenshot above): hadoop jar WordCount.jar WordCount /user/hue/wc-inp /user/hue/wc-out2 Step 7: Uploading the text files into the directory using HUE Post step 6, one has to log into HUE with the url, username & password given for Hue as under on the screenshot:
  • 9. Step 8: Modification of the shell scripts to point to correct input and output directories After login into HUE, the correct shell scripts have to be defined to point out input & output directories, the same is done using the file browser in HUE screen tab as under: Step 9: Compilation & Execution of the Java programs The wc-out2 is the file output directory the Word count program executed is saved for output in this directory as under:
  • 10. Step 10: WordCount program output The wordcount output post the compilation of the 3 java programs can be seen under the Job Browser tab of Hue using the username root, through the same we can view the java complied program is succeeded and the same is saved under WordCount.jar directory for output as under: The Wordcount of the 3 java program complied in steps above is in hadoop hue wc-out2/part-r-00000 file path as under:
  • 12. We can see from the screenshots the word count from the java programs Sumreducer.java, Wordmapper.java & Wordcount.java.