SlideShare a Scribd company logo
Aadhaar dataset analysis
using big data hadoop
●
Name- Abhishek Verma
●
Submitted to- Eckovation
●
Course-summer internship program in
computer science and IT.
2
TECHNOLOGIES USED
●
Cloudera virtual machine running on cent
os using virtual box.
●
HDFS(hadoop distributed file system).
●
Linux shell terminal.
●
Apache Hive.
3
Procedure or steps
taken.
●
Using hadoop HDFS to transfer
the ‘.csv’ file from local file
system into the hadoop HDFS.
●
Entering hive shell and creating
table.
●
Transferring the data in HDFS to
hive.
●
Performing data analysis on the
data inside the table using hive
querries.
4
Using hadoop to transfer
file from local file system in
to the HDFS
●
File adhar.csv (csv stands for comma
separated file) is downloaded from the
UIDAI website.
●
Commands are run in terminal of cloudera
machine.
●
Command for entering a file from local file
sysetem into hadoop is-- hadoop fs -
copyFromLocal /”path of the file”. So in our
case the full command is as follows.
●
hadoop fs -copyFromLocal
/home/cloudera/Desktop/adhar.csv
5
Entering hive shell and
creating table
●
The command to enter into hive shell is “hive” without
quotes.
●
Once in hive shell a database is required to work upon by
default there is a default database but it is recommended to
make a new database for a new project.
●
Command to create a new database is create databse
“database name”; which in our case is create databse
project3;
●
Entering/using the database using command-- USE
project3;
●
Creating table inside hive with formats for each column.
Using command – CREATE TABLE adhar_dat3 ( registrar
STRING, Enrolment_Agency STRING, State STRING,
District STRING, Sub_District STRING, Pin_Code
STRING,Gender STRING, Age STRING,
Aadhaar_generated INT, Enrolment_Rejected
INT,Residents_providing_email INT,
Residents_providing_mobile_number INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE;
6
Transfering data from
HADOOP to hive table.
●
Once the table is defined the data can be
entered from either local filesyste or HDFS in
our case HDFS is used.
●
The command for loading data from hadoop
HDFS to hive table is give as --LOAD DATA
INPATH '/user/cloudera/adhar.csv'
adhar_dat3;
●
Where “/user/cloudera/adhar.csv” is the file
location in HDFS and adhar_dat3 is the name
of hive table which was defined earlier.
7
Performing analysis on data
using queries
●
Several queries can be performed on the data
according to the need of the user.
Queries executed in this case are as follows.
●
To find the no of aadhaar generated by each State.
●
No of total aadhaar based on gender as
distinguishing factor.
●
Average age of an aadhaar applicant from each
state of country.
●
Gives the name of enrollment agencies who
rejected at least one aadhaar application along
with the no of application rejected by the
respective agencies.
●
Gives the minimum age of applicant from each
state whose enrollment was accepted.
●
To find the no of aadhaar generated by each
District.
8
To find the no of aadhaar generated
by each State.
select State,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by State ;
9
No of total aadhaar based on gender
as distinguishing factor.
select Gender,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by Gender;
10
Average age of an aadhaar
applicant from each state of
country.
SELECT State, round(avg(Age),1) as r1 FROM
adhar_dat3 GROUP BY State ORDER BY r1;
11
Gives the name of enrollment
agencies who rejected atleast one
aadhaar application along with the
no of application rejected by the
respective agencies.
select
Enrolment_Agency,count(Enrolment_Rejected)
from adhar_dat3 where(Enrolment_Rejected=1)
group by Enrolment_Agency;
12
Gives the maximum age of
applicant from each state whose
enrollment was accepted.
select State,max(Age) AS cnt from adhar_dat3
where(Enrolment_Rejected=0) group by State;
13
To find the no of aadhaar
generated by each District.
select District,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by District ;
14
conclusion
●
Hadoop makes it possible to analyze data
that is otherwise impossible to analyze due
to its huge size.
●
Map reduce scipts are applied to the data in
hdfs to obtain required info from huge data
sets or weblogs.
●
Apart from classic map scripts which is
written in java and require to make a jar file
to work with out data the hive,pig, etc are
easier to write because of its similarities to
that of SQL.
●
Spark and impalla are emerging technologies
that may very well replace hadoop map
reduce because map reduce does not offer
real time processing and is 100 times slower
as claimed by Apache spark.

More Related Content

PPTX
Hive presentation
Hitesh Agrawal
 
PPTX
NIST Cloud Computing Reference Architecture.pptx
DevikaPalanisamy2
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PPTX
Fundamental Concepts-and-Models Cloud Computing
Mohammed Sajjad Ali
 
PPTX
Dns resource record
rahuldaredia21
 
PPTX
Mobile internet protocol
SaranyaK68
 
PPTX
Introduction To HBase
Anil Gupta
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hive presentation
Hitesh Agrawal
 
NIST Cloud Computing Reference Architecture.pptx
DevikaPalanisamy2
 
Introduction to Hadoop Technology
Manish Borkar
 
Fundamental Concepts-and-Models Cloud Computing
Mohammed Sajjad Ali
 
Dns resource record
rahuldaredia21
 
Mobile internet protocol
SaranyaK68
 
Introduction To HBase
Anil Gupta
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 

What's hot (20)

PPTX
Map Reduce
Prashant Gupta
 
KEY
Hardware supports for Virtualization
Yoonje Choi
 
PPTX
Osi models
sivasarah
 
PDF
Intro to DNS
ThousandEyes
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
Big Data Sunum
Serkan Sakınmaz
 
PPTX
DBMS introduction
BHARATH KUMAR
 
PDF
Cloud Computing - Introduction
Dr. Sunil Kr. Pandey
 
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
PDF
Big data-cheat-sheet
masoodkhh
 
PPTX
4. install and configure hyper v
Hameda Hurmat
 
PPTX
Map reduce presentation
ateeq ateeq
 
PPTX
Comparison with Traditional databases
GowriLatha1
 
PDF
Hadoop architecture-tutorial
vinayiqbusiness
 
PPTX
MySQL.pptx
SHAQORPRO
 
PPTX
Distributed database
sanjay joshi
 
PPTX
Service level agreement in cloud computing an overview
Dr Neelesh Jain
 
PDF
Presentation3 Multi-User Architecture.pdf
ssuserd86b931
 
PPT
Database management system basics and it applications
RAJESH S
 
PPTX
Green computing
kunalsahu9883
 
Map Reduce
Prashant Gupta
 
Hardware supports for Virtualization
Yoonje Choi
 
Osi models
sivasarah
 
Intro to DNS
ThousandEyes
 
Introduction to Map Reduce
Apache Apex
 
Big Data Sunum
Serkan Sakınmaz
 
DBMS introduction
BHARATH KUMAR
 
Cloud Computing - Introduction
Dr. Sunil Kr. Pandey
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Big data-cheat-sheet
masoodkhh
 
4. install and configure hyper v
Hameda Hurmat
 
Map reduce presentation
ateeq ateeq
 
Comparison with Traditional databases
GowriLatha1
 
Hadoop architecture-tutorial
vinayiqbusiness
 
MySQL.pptx
SHAQORPRO
 
Distributed database
sanjay joshi
 
Service level agreement in cloud computing an overview
Dr Neelesh Jain
 
Presentation3 Multi-User Architecture.pdf
ssuserd86b931
 
Database management system basics and it applications
RAJESH S
 
Green computing
kunalsahu9883
 
Ad

Similar to report on aadhaar anlysis using bid data hadoop and hive (20)

PPTX
Big Data & Hadoop Data Analysis
Koushik Mondal
 
PDF
Case study ap log collector
Jyun-Yao Huang
 
PPTX
Big Data Summer training presentation
HarshitaKamboj
 
PDF
InternReport
Swetha Tanamala
 
PDF
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
PDF
Hadoop paper
ATWIINE Simon Alex
 
PDF
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
PPTX
Get started with Microsoft SQL Polybase
Henk van der Valk
 
PPTX
Session 04 pig - slides
AnandMHadoop
 
PDF
Basics of big data analytics hadoop
Ambuj Kumar
 
PPTX
Basic of Big Data
Amar kumar
 
PDF
R server and spark
BAINIDA
 
PPTX
Hadoop data access layer v4.0
SpringPeople
 
DOCX
Apache hive
Ayapparaj SKS
 
PPTX
Unit-3.pptx
JasmineMichael1
 
PPTX
Big data processing using hadoop poster presentation
Amrut Patil
 
PDF
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
Big Data & Hadoop Data Analysis
Koushik Mondal
 
Case study ap log collector
Jyun-Yao Huang
 
Big Data Summer training presentation
HarshitaKamboj
 
InternReport
Swetha Tanamala
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
Hadoop paper
ATWIINE Simon Alex
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Get started with Microsoft SQL Polybase
Henk van der Valk
 
Session 04 pig - slides
AnandMHadoop
 
Basics of big data analytics hadoop
Ambuj Kumar
 
Basic of Big Data
Amar kumar
 
R server and spark
BAINIDA
 
Hadoop data access layer v4.0
SpringPeople
 
Apache hive
Ayapparaj SKS
 
Unit-3.pptx
JasmineMichael1
 
Big data processing using hadoop poster presentation
Amrut Patil
 
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Big Data and Hadoop Basics
Sonal Tiwari
 
Ad

Recently uploaded (20)

PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Chad Readey - An Independent Thinker
Chad Readey
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 

report on aadhaar anlysis using bid data hadoop and hive

  • 1. Aadhaar dataset analysis using big data hadoop ● Name- Abhishek Verma ● Submitted to- Eckovation ● Course-summer internship program in computer science and IT.
  • 2. 2 TECHNOLOGIES USED ● Cloudera virtual machine running on cent os using virtual box. ● HDFS(hadoop distributed file system). ● Linux shell terminal. ● Apache Hive.
  • 3. 3 Procedure or steps taken. ● Using hadoop HDFS to transfer the ‘.csv’ file from local file system into the hadoop HDFS. ● Entering hive shell and creating table. ● Transferring the data in HDFS to hive. ● Performing data analysis on the data inside the table using hive querries.
  • 4. 4 Using hadoop to transfer file from local file system in to the HDFS ● File adhar.csv (csv stands for comma separated file) is downloaded from the UIDAI website. ● Commands are run in terminal of cloudera machine. ● Command for entering a file from local file sysetem into hadoop is-- hadoop fs - copyFromLocal /”path of the file”. So in our case the full command is as follows. ● hadoop fs -copyFromLocal /home/cloudera/Desktop/adhar.csv
  • 5. 5 Entering hive shell and creating table ● The command to enter into hive shell is “hive” without quotes. ● Once in hive shell a database is required to work upon by default there is a default database but it is recommended to make a new database for a new project. ● Command to create a new database is create databse “database name”; which in our case is create databse project3; ● Entering/using the database using command-- USE project3; ● Creating table inside hive with formats for each column. Using command – CREATE TABLE adhar_dat3 ( registrar STRING, Enrolment_Agency STRING, State STRING, District STRING, Sub_District STRING, Pin_Code STRING,Gender STRING, Age STRING, Aadhaar_generated INT, Enrolment_Rejected INT,Residents_providing_email INT, Residents_providing_mobile_number INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE;
  • 6. 6 Transfering data from HADOOP to hive table. ● Once the table is defined the data can be entered from either local filesyste or HDFS in our case HDFS is used. ● The command for loading data from hadoop HDFS to hive table is give as --LOAD DATA INPATH '/user/cloudera/adhar.csv' adhar_dat3; ● Where “/user/cloudera/adhar.csv” is the file location in HDFS and adhar_dat3 is the name of hive table which was defined earlier.
  • 7. 7 Performing analysis on data using queries ● Several queries can be performed on the data according to the need of the user. Queries executed in this case are as follows. ● To find the no of aadhaar generated by each State. ● No of total aadhaar based on gender as distinguishing factor. ● Average age of an aadhaar applicant from each state of country. ● Gives the name of enrollment agencies who rejected at least one aadhaar application along with the no of application rejected by the respective agencies. ● Gives the minimum age of applicant from each state whose enrollment was accepted. ● To find the no of aadhaar generated by each District.
  • 8. 8 To find the no of aadhaar generated by each State. select State,count(Aadhaar_generated) AS cnt from adhar_dat3 group by State ;
  • 9. 9 No of total aadhaar based on gender as distinguishing factor. select Gender,count(Aadhaar_generated) AS cnt from adhar_dat3 group by Gender;
  • 10. 10 Average age of an aadhaar applicant from each state of country. SELECT State, round(avg(Age),1) as r1 FROM adhar_dat3 GROUP BY State ORDER BY r1;
  • 11. 11 Gives the name of enrollment agencies who rejected atleast one aadhaar application along with the no of application rejected by the respective agencies. select Enrolment_Agency,count(Enrolment_Rejected) from adhar_dat3 where(Enrolment_Rejected=1) group by Enrolment_Agency;
  • 12. 12 Gives the maximum age of applicant from each state whose enrollment was accepted. select State,max(Age) AS cnt from adhar_dat3 where(Enrolment_Rejected=0) group by State;
  • 13. 13 To find the no of aadhaar generated by each District. select District,count(Aadhaar_generated) AS cnt from adhar_dat3 group by District ;
  • 14. 14 conclusion ● Hadoop makes it possible to analyze data that is otherwise impossible to analyze due to its huge size. ● Map reduce scipts are applied to the data in hdfs to obtain required info from huge data sets or weblogs. ● Apart from classic map scripts which is written in java and require to make a jar file to work with out data the hive,pig, etc are easier to write because of its similarities to that of SQL. ● Spark and impalla are emerging technologies that may very well replace hadoop map reduce because map reduce does not offer real time processing and is 100 times slower as claimed by Apache spark.