report on aadhaar anlysis using bid data hadoop and hive

Aadhaar dataset analysis
using big data hadoop
●
Name- Abhishek Verma
●
Submitted to- Eckovation
●
Course-summer internship program in
computer science and IT.

2
TECHNOLOGIES USED
●
Cloudera virtual machine running on cent
os using virtual box.
●
HDFS(hadoop distributed file system).
●
Linux shell terminal.
●
Apache Hive.

3
Procedure or steps
taken.
●
Using hadoop HDFS to transfer
the ‘.csv’ file from local file
system into the hadoop HDFS.
●
Entering hive shell and creating
table.
●
Transferring the data in HDFS to
hive.
●
Performing data analysis on the
data inside the table using hive
querries.

4
Using hadoop to transfer
file from local file system in
to the HDFS
●
File adhar.csv (csv stands for comma
separated file) is downloaded from the
UIDAI website.
●
Commands are run in terminal of cloudera
machine.
●
Command for entering a file from local file
sysetem into hadoop is-- hadoop fs -
copyFromLocal /”path of the file”. So in our
case the full command is as follows.
●
hadoop fs -copyFromLocal
/home/cloudera/Desktop/adhar.csv

5
Entering hive shell and
creating table
●
The command to enter into hive shell is “hive” without
quotes.
●
Once in hive shell a database is required to work upon by
default there is a default database but it is recommended to
make a new database for a new project.
●
Command to create a new database is create databse
“database name”; which in our case is create databse
project3;
●
Entering/using the database using command-- USE
project3;
●
Creating table inside hive with formats for each column.
Using command – CREATE TABLE adhar_dat3 ( registrar
STRING, Enrolment_Agency STRING, State STRING,
District STRING, Sub_District STRING, Pin_Code
STRING,Gender STRING, Age STRING,
Aadhaar_generated INT, Enrolment_Rejected
INT,Residents_providing_email INT,
Residents_providing_mobile_number INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE;

6
Transfering data from
HADOOP to hive table.
●
Once the table is defined the data can be
entered from either local filesyste or HDFS in
our case HDFS is used.
●
The command for loading data from hadoop
HDFS to hive table is give as --LOAD DATA
INPATH '/user/cloudera/adhar.csv'
adhar_dat3;
●
Where “/user/cloudera/adhar.csv” is the file
location in HDFS and adhar_dat3 is the name
of hive table which was defined earlier.

7
Performing analysis on data
using queries
●
Several queries can be performed on the data
according to the need of the user.
Queries executed in this case are as follows.
●
To find the no of aadhaar generated by each State.
●
No of total aadhaar based on gender as
distinguishing factor.
●
Average age of an aadhaar applicant from each
state of country.
●
Gives the name of enrollment agencies who
rejected at least one aadhaar application along
with the no of application rejected by the
respective agencies.
●
Gives the minimum age of applicant from each
state whose enrollment was accepted.
●
To find the no of aadhaar generated by each
District.

8
To find the no of aadhaar generated
by each State.
select State,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by State ;

9
No of total aadhaar based on gender
as distinguishing factor.
select Gender,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by Gender;

10
Average age of an aadhaar
applicant from each state of
country.
SELECT State, round(avg(Age),1) as r1 FROM
adhar_dat3 GROUP BY State ORDER BY r1;

11
Gives the name of enrollment
agencies who rejected atleast one
aadhaar application along with the
no of application rejected by the
respective agencies.
select
Enrolment_Agency,count(Enrolment_Rejected)
from adhar_dat3 where(Enrolment_Rejected=1)
group by Enrolment_Agency;

12
Gives the maximum age of
applicant from each state whose
enrollment was accepted.
select State,max(Age) AS cnt from adhar_dat3
where(Enrolment_Rejected=0) group by State;

13
To find the no of aadhaar
generated by each District.
select District,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by District ;

14
conclusion
●
Hadoop makes it possible to analyze data
that is otherwise impossible to analyze due
to its huge size.
●
Map reduce scipts are applied to the data in
hdfs to obtain required info from huge data
sets or weblogs.
●
Apart from classic map scripts which is
written in java and require to make a jar file
to work with out data the hive,pig, etc are
easier to write because of its similarities to
that of SQL.
●
Spark and impalla are emerging technologies
that may very well replace hadoop map
reduce because map reduce does not offer
real time processing and is 100 times slower
as claimed by Apache spark.

report on aadhaar anlysis using bid data hadoop and hive

More Related Content

What's hot (20)

Similar to report on aadhaar anlysis using bid data hadoop and hive (20)

Recently uploaded (20)

report on aadhaar anlysis using bid data hadoop and hive