SlideShare a Scribd company logo
Big Data Project Presentation
Team Members: Shrinivasaragav Balasubramanian, Shelley
Bhatnagar
STACK OVERFLOW DATASET ANALYSIS
 The Dataset is obtained from Stack Exchange Data Dump at the Internet
Archive.
 The link to the Dataset is as follows :
https://archive.org/details/stackexchange
 Each site under Stack Exchange is formatted as a separate archive
consisting of XML files zipped via 7-zip that includes various files.
 We chose the Stack Overflow Data Segment under the Stack Exchange
Dump which originally is around ~ 20 GB and we brought it to 3 GB for
performing analysis.
Dataset Overview:
 Stack Overflow Dataset consists of following files that are treated as
tables in our Database Design:
 Posts
 PostLinks
 Tags
 Users
 Votes
 Batches
 Comments
Dataset Overview:
STACK OVERFLOW DATASET ANALYSIS
 Since our dataset is in xml format, we designed parsers for each file i.e
table, to process the data easily and dump the data into HDFS.
 The parsers were designed into a Java Application, implementing Mapper
and Reducer while configuring a job in Hadoop to parse the data.
 The Jar is run in Hadoop Distributed Mode and the parsed data is dumped
into HDFS.
 Each file in dataset consists of 12 million + entries.
 Each table had 6-7 attributes in average while also consisting of missing
attributes, empty fields and hence inconsistent data entries which the
parser took care of.
Mission:
 The Posts table consisted of an attribute named PostTypeId which is 1 if
the Post is a Question Post and 2 is the Post is an answer to the Question.
 Since most of our analysis was centered on this table, we divided the
table into PostQuestions and PostAnswers to make the analysis simple.
 Eg. <row Id="1258222" PostTypeId="2" ParentId="1238775“
CreationDate="2009-08-11T02:29:20.380" Score="1"
Body="&lt;p&gt;Lisp. There are so many Lisp systems out there defined in
terms of rules not imperative commands. Google ahoy...&lt;/p&gt;&#xA;"
OwnerUserId="16709" LastActivityDate="2009-08-11T02:29:20.380"
CommentCount="0" />
Posts Table:
 The trending Questions that are viewed and scored highly by users.
 The Questions that doesn’t have any answers.
 The Questions that have been marked closed for each category.
 The Questions that are dead and have no activity past 2 years.
 The most viewed questions in each category.
 The most scored questions in each category
 The count of posted questions of each category over a timeframe (say 2
years).
 The list of tags other than standard tags.
 The top posted Questions in each category.
Analysis using Posts
 The RANK of the Post in the dataset.
 Approximate time for a User Post in a category to expect a correct answer
or a working solution.
Analysis on Posts (cont)
 The User profile with maximum views.
 The top users with maximum reputation points.
 Most valuable users in the dataset.
 The numbers of users that have been awarded batches.
 The count of users creating account in a given timeframe (say 6 months).
 Recommending users to contribute an answer for a similarly liked
category.
 The inactive accounts over a range of time.
 Total Number of dead accounts.
 The Number of users bearing various batches
Analysis on Users:
 The comments that have a count greater than average count.
 The users posting maximum number of comments.
 The Question Post that have highest number of comments.
Analysis on Comments
 The number of spam comments in the dataset.
 The Users that contribute to the spam posts.
 The Posts that are scheduled to be deleted from the data dump over a period
of say (6 months).
 The top users carrying votes titled as favorite.
Analysis on Votes
 A page rank is calculated to find out the weightage of the posted Query
contributed by a user into the dump.
 Each Post written as a question maybe linked to several other similar posts
that are posted by users having similar doubts.
 Similarly each answer to a post can be referred by another post.
 Hence, Page Rank is a ‘’VOTE” by all the other posts in the dataset.
 A link to a Post counts as a vote of support, absence of which indicates
lack of support.
Overview of Internal Page Rank Analysis:
 Thus if we have a Post with PostId = A, which have Posts T1…..Tn pointing
to it, we take a dumping factor between 0 – 1 and we have define C(A) to
be as the number of links associated with the Post, the Page Rank of a
Post is given as follows:
 PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Page Rank Formula:
 The Page Rank of each Post depends on the post linked to it.
 It is calculates without knowing the final value of Page Rank.
 Thus we run the calculation repeatedly which takes us closer to the
estimated final value.
How is Page Rank Calculated?
 The “damping factor” is quite subtle.
 If it’s too high then it takes ages for the numbers to settle,
 if it’s too low then you get repeated over-shoot
 We performed analysis for achieving the optimal damping factor.
 The Damping factor chosen for this Dataset is 0.25.
 No matter from where we start the guess, once settled, the average Page
Rank of all pages will be 1.0
Choosing the Dumping Factor:
Example
Web Application: Internal Page Rank Analysis
 The analysis predicts and provides an estimates time in which a user can
expect an activity on the Post.
 Analysis involved categorizing the dataset according to the tags.
 For each posted question the fastest reply was taken into consideration
and the time difference between posting a question and getting the first
reply was calculated.
 This difference was averaged for all the posts belonging to a category,
thereby predicting the activity on a post.
Predicting First Activity Time On A Post
 In the application, a user can provide the tags he/she would be using for
their posts.
 Based on the tags provided, the application will calculate the average
time taken for an activity on each tag and then average the two results.
How This Works In The Application
 Creating a graph structure based on Posts and Related Posts.
 Graph will comprise of Nodes and Edges.
 Each Node will have several Edges and each Edges will be a Node again
will several Edges.
 Created a Pig UDF where all the Posts and Related Posts are sent as a
Group.
 Based on the input a graph gets created.
 Rank is calculated based on how many incoming links each Node has.
 The more the number of incoming links, the higher the Page Rank.
How We Did It
 Integrated Hive with the existing Hbase table.
 We need to provide the hbase.columns.mapping whereas
hbase.table.name is optional to provide.
 We use HbaseStorage Handler to allow Hive to interact with Hbase.
Hive Hbase Integration
 HiveServer is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and retrieve
results.
 We used the Hive Thrift Server to connect with the Hive Tables from the
Web Application.
 Starting the Hive Thrift Server: hive –service hiveserver
 Connection String:
Hive Thrift Server
 Providing Suggestions to users regarding the various questions they can
answer from other categories.
 We have taken the User ID, Category ID and the Interaction level as the
input to Mahout User Recommender.
Mahout User Based Recommender
 We used pig queries to join the various tables and get an output which
contained User ID, Category ID and Interaction level.
 We used this output as an input to the Mahout User Based
Recommender.
 We converted the Interaction Level values to be in the range of 0 to 5.
 We used the PearsonCorrelationSimilarity and the NearestNNeighbours
as the neighborhood.
 We then used the UserBased Recommender to provide 3 suggestions of
other Categories for which the user can provide his contribution by
answering the questions.
How Did We Implement It
Web Application: Mahout Recommender
 We were able to incorporate our analysis in a Web Appplication.
 The Web Application retrieves the required data using Hbase and Hive
technologies.
 Below attached are screenshots of the application and the analysis that
has been performed.
 We have used Google Charts for displaying our analysis in a graph.
Web Application
Questions Posted By User: Used HBase
Tag Count Analysis: Most Used Tags
Dead Accounts Analysis
Closed Questions Analysis
Comments To Answers Analysis
Top Questions Analysis
Trending Posts Analysis
Monthly Deleted Posts
Answered Vs Unanswered Questions
Finding Average Answer Time
Internal Page Rank Analysis
Mahout Recommender
 Performance depends upon input sizes and MR FS chunk size.
 While there were queries that required sorting of data, many temp files
were created and written onto the disc.
 The performance of MR is evaluated by reviewing the counters for map
task.
 In the Parser Implemented to read the xml file, there were significant
problems faced.
 The number of spilled records were significantly more than the map task
read that resulted in NullPointerException with the message:
INFO mapreduce.Job: Job job_local1747290386_0001 failed with
state FAILED due to: NA
Problem Faced:

More Related Content

What's hot (20)

PPTX
PR-153: SNAIL: A Simple Neural Attentive Meta-Learner
Seoul National University
 
PDF
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
 
PPTX
Wiresharkの解析プラグインを作る ssmjp 201409
稔 小林
 
PPTX
게임프로젝트에 적용하는 GPGPU
YEONG-CHEON YOU
 
PDF
Goとテスト
Takuya Ueda
 
PPTX
Object Detection using Deep Neural Networks
Usman Qayyum
 
PDF
Everyday Life with clojure.spec
Kent Ohashi
 
PPTX
Open Closed Principle kata
Paul Blundell
 
PPTX
Activation Function.pptx
AamirMaqsood8
 
PDF
Ndc2012 최지호 텍스쳐 압축 기법 소개
Jiho Choi
 
PDF
Machine Learning Interpretability
inovex GmbH
 
PDF
Introduction to 3D Computer Vision and Differentiable Rendering
Preferred Networks
 
PDF
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
Madumpa Park
 
PPTX
0円でできる自宅InfiniBandプログラム
Minoru Nakamura
 
PPTX
Moving object detection
Raviraj singh shekhawat
 
PPTX
[C++ Korea 2nd Seminar] Ranges for The Cpp Standard Library
DongMin Choi
 
PDF
unique_ptrにポインタ以外のものを持たせるとき
Shintarou Okada
 
PDF
NDC 2014 Beyond Code: <야생의 땅:듀랑고>의 좌충우돌 개발 과정 - 프로그래머가 챙겨주는 또 다른 개발자 사용 설명서
영준 박
 
PPTX
書くネタがCoqしかない
Masaki Hara
 
PPTX
Effective C++ Chaper 1
연우 김
 
PR-153: SNAIL: A Simple Neural Attentive Meta-Learner
Seoul National University
 
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
 
Wiresharkの解析プラグインを作る ssmjp 201409
稔 小林
 
게임프로젝트에 적용하는 GPGPU
YEONG-CHEON YOU
 
Goとテスト
Takuya Ueda
 
Object Detection using Deep Neural Networks
Usman Qayyum
 
Everyday Life with clojure.spec
Kent Ohashi
 
Open Closed Principle kata
Paul Blundell
 
Activation Function.pptx
AamirMaqsood8
 
Ndc2012 최지호 텍스쳐 압축 기법 소개
Jiho Choi
 
Machine Learning Interpretability
inovex GmbH
 
Introduction to 3D Computer Vision and Differentiable Rendering
Preferred Networks
 
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
Madumpa Park
 
0円でできる自宅InfiniBandプログラム
Minoru Nakamura
 
Moving object detection
Raviraj singh shekhawat
 
[C++ Korea 2nd Seminar] Ranges for The Cpp Standard Library
DongMin Choi
 
unique_ptrにポインタ以外のものを持たせるとき
Shintarou Okada
 
NDC 2014 Beyond Code: <야생의 땅:듀랑고>의 좌충우돌 개발 과정 - 프로그래머가 챙겨주는 또 다른 개발자 사용 설명서
영준 박
 
書くネタがCoqしかない
Masaki Hara
 
Effective C++ Chaper 1
연우 김
 

Viewers also liked (20)

PPT
StackOverflow Architectural Overview
Folio3 Software
 
PPTX
Stackoverflow Data Analysis-Homework3
Ayush Tak
 
PDF
Stack Overflow slides Data Analytics
Rahul Thankachan
 
PPTX
Understanding Stack Overflow
Alexander Serebrenik
 
PDF
Analyzing Stack Overflow - Problem
Amrith Krishna
 
PPTX
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Ontico
 
PDF
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
How to Web
 
PPTX
Lanubile@SSE2013
Filippo Lanubile
 
ODP
Software Engineering and Social media
Jorge Melegati
 
PPT
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Margaret-Anne Storey
 
PPTX
Stack overflow growth model
usama0581
 
PPTX
Mining Sociotechnical Information From Software Repositories
Marco Aurelio Gerosa
 
PPT
Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag...
Departamento de Informática Educativa UNAN-Managua
 
PPT
Implementación Repositorio De Objetos De Aprendizajes Basado En
f.cabrera1
 
KEY
groovy & grails - lecture 13
Alexandre Masselot
 
PPTX
What is Node.js used for: The 2015 Node.js Overview Report
Gabor Nagy
 
PDF
Soluciones tecnológicas para REA
Ricardo Corai
 
PDF
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Paola Amadeo
 
PPTX
Stack_Overflow-Network_Graph
Yaopeng (Gyoho) Wu
 
PPTX
Responsive Design
MRMtech
 
StackOverflow Architectural Overview
Folio3 Software
 
Stackoverflow Data Analysis-Homework3
Ayush Tak
 
Stack Overflow slides Data Analytics
Rahul Thankachan
 
Understanding Stack Overflow
Alexander Serebrenik
 
Analyzing Stack Overflow - Problem
Amrith Krishna
 
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Ontico
 
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
How to Web
 
Lanubile@SSE2013
Filippo Lanubile
 
Software Engineering and Social media
Jorge Melegati
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Margaret-Anne Storey
 
Stack overflow growth model
usama0581
 
Mining Sociotechnical Information From Software Repositories
Marco Aurelio Gerosa
 
Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag...
Departamento de Informática Educativa UNAN-Managua
 
Implementación Repositorio De Objetos De Aprendizajes Basado En
f.cabrera1
 
groovy & grails - lecture 13
Alexandre Masselot
 
What is Node.js used for: The 2015 Node.js Overview Report
Gabor Nagy
 
Soluciones tecnológicas para REA
Ricardo Corai
 
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Paola Amadeo
 
Stack_Overflow-Network_Graph
Yaopeng (Gyoho) Wu
 
Responsive Design
MRMtech
 
Ad

Similar to STACK OVERFLOW DATASET ANALYSIS (20)

DOCX
Sudhanshu kumar hadoop
sudhanshu kumar
 
PDF
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET Journal
 
PPTX
Operation Analytics and Investigating Metric Spike - Report.pptx
Arnab Bhattacharya
 
PDF
Website Content Analysis Using Clickstream Data and Apriori Algorithm
TELKOMNIKA JOURNAL
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
PDF
InternReport
Swetha Tanamala
 
PPTX
Clickstream ppt copy
Surbhi Sonkhaskar
 
PDF
Large scale Click-streaming and tranaction log mining
itstuff
 
PDF
IEEE.BigData.Tutorial.2.slides
Nish Parikh
 
PPTX
ORGANIZING USER SEARCH HISTORIES
GULAM EHIND AHMAD ABBAS
 
PDF
Partha Dey (Resume)
Partha Dey
 
PDF
Partha dey (resume)
Partha Dey
 
PPTX
Personalizing Information Exploration with an Open User Model
Behnam Rahdari
 
PPTX
whyPostgres, a presentation on the project choice for a storage system
amrshams2015as
 
PPTX
ORGANIZING USER SEARCH HISTORIES
GULAM EHIND AHMAD ABBAS
 
PDF
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
PDF
Big data-and-the-web
Aravindharamanan S
 
PDF
Big data tutorial_part4
heyramzz
 
Sudhanshu kumar hadoop
sudhanshu kumar
 
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET Journal
 
Operation Analytics and Investigating Metric Spike - Report.pptx
Arnab Bhattacharya
 
Website Content Analysis Using Clickstream Data and Apriori Algorithm
TELKOMNIKA JOURNAL
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
InternReport
Swetha Tanamala
 
Clickstream ppt copy
Surbhi Sonkhaskar
 
Large scale Click-streaming and tranaction log mining
itstuff
 
IEEE.BigData.Tutorial.2.slides
Nish Parikh
 
ORGANIZING USER SEARCH HISTORIES
GULAM EHIND AHMAD ABBAS
 
Partha Dey (Resume)
Partha Dey
 
Partha dey (resume)
Partha Dey
 
Personalizing Information Exploration with an Open User Model
Behnam Rahdari
 
whyPostgres, a presentation on the project choice for a storage system
amrshams2015as
 
ORGANIZING USER SEARCH HISTORIES
GULAM EHIND AHMAD ABBAS
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Gabriela Agustini
 
Big data-and-the-web
Aravindharamanan S
 
Big data tutorial_part4
heyramzz
 
Ad

STACK OVERFLOW DATASET ANALYSIS

  • 1. Big Data Project Presentation Team Members: Shrinivasaragav Balasubramanian, Shelley Bhatnagar STACK OVERFLOW DATASET ANALYSIS
  • 2.  The Dataset is obtained from Stack Exchange Data Dump at the Internet Archive.  The link to the Dataset is as follows : https://archive.org/details/stackexchange  Each site under Stack Exchange is formatted as a separate archive consisting of XML files zipped via 7-zip that includes various files.  We chose the Stack Overflow Data Segment under the Stack Exchange Dump which originally is around ~ 20 GB and we brought it to 3 GB for performing analysis. Dataset Overview:
  • 3.  Stack Overflow Dataset consists of following files that are treated as tables in our Database Design:  Posts  PostLinks  Tags  Users  Votes  Batches  Comments Dataset Overview:
  • 5.  Since our dataset is in xml format, we designed parsers for each file i.e table, to process the data easily and dump the data into HDFS.  The parsers were designed into a Java Application, implementing Mapper and Reducer while configuring a job in Hadoop to parse the data.  The Jar is run in Hadoop Distributed Mode and the parsed data is dumped into HDFS.  Each file in dataset consists of 12 million + entries.  Each table had 6-7 attributes in average while also consisting of missing attributes, empty fields and hence inconsistent data entries which the parser took care of. Mission:
  • 6.  The Posts table consisted of an attribute named PostTypeId which is 1 if the Post is a Question Post and 2 is the Post is an answer to the Question.  Since most of our analysis was centered on this table, we divided the table into PostQuestions and PostAnswers to make the analysis simple.  Eg. <row Id="1258222" PostTypeId="2" ParentId="1238775“ CreationDate="2009-08-11T02:29:20.380" Score="1" Body="&lt;p&gt;Lisp. There are so many Lisp systems out there defined in terms of rules not imperative commands. Google ahoy...&lt;/p&gt;&#xA;" OwnerUserId="16709" LastActivityDate="2009-08-11T02:29:20.380" CommentCount="0" /> Posts Table:
  • 7.  The trending Questions that are viewed and scored highly by users.  The Questions that doesn’t have any answers.  The Questions that have been marked closed for each category.  The Questions that are dead and have no activity past 2 years.  The most viewed questions in each category.  The most scored questions in each category  The count of posted questions of each category over a timeframe (say 2 years).  The list of tags other than standard tags.  The top posted Questions in each category. Analysis using Posts
  • 8.  The RANK of the Post in the dataset.  Approximate time for a User Post in a category to expect a correct answer or a working solution. Analysis on Posts (cont)
  • 9.  The User profile with maximum views.  The top users with maximum reputation points.  Most valuable users in the dataset.  The numbers of users that have been awarded batches.  The count of users creating account in a given timeframe (say 6 months).  Recommending users to contribute an answer for a similarly liked category.  The inactive accounts over a range of time.  Total Number of dead accounts.  The Number of users bearing various batches Analysis on Users:
  • 10.  The comments that have a count greater than average count.  The users posting maximum number of comments.  The Question Post that have highest number of comments. Analysis on Comments
  • 11.  The number of spam comments in the dataset.  The Users that contribute to the spam posts.  The Posts that are scheduled to be deleted from the data dump over a period of say (6 months).  The top users carrying votes titled as favorite. Analysis on Votes
  • 12.  A page rank is calculated to find out the weightage of the posted Query contributed by a user into the dump.  Each Post written as a question maybe linked to several other similar posts that are posted by users having similar doubts.  Similarly each answer to a post can be referred by another post.  Hence, Page Rank is a ‘’VOTE” by all the other posts in the dataset.  A link to a Post counts as a vote of support, absence of which indicates lack of support. Overview of Internal Page Rank Analysis:
  • 13.  Thus if we have a Post with PostId = A, which have Posts T1…..Tn pointing to it, we take a dumping factor between 0 – 1 and we have define C(A) to be as the number of links associated with the Post, the Page Rank of a Post is given as follows:  PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Page Rank Formula:
  • 14.  The Page Rank of each Post depends on the post linked to it.  It is calculates without knowing the final value of Page Rank.  Thus we run the calculation repeatedly which takes us closer to the estimated final value. How is Page Rank Calculated?
  • 15.  The “damping factor” is quite subtle.  If it’s too high then it takes ages for the numbers to settle,  if it’s too low then you get repeated over-shoot  We performed analysis for achieving the optimal damping factor.  The Damping factor chosen for this Dataset is 0.25.  No matter from where we start the guess, once settled, the average Page Rank of all pages will be 1.0 Choosing the Dumping Factor:
  • 17. Web Application: Internal Page Rank Analysis
  • 18.  The analysis predicts and provides an estimates time in which a user can expect an activity on the Post.  Analysis involved categorizing the dataset according to the tags.  For each posted question the fastest reply was taken into consideration and the time difference between posting a question and getting the first reply was calculated.  This difference was averaged for all the posts belonging to a category, thereby predicting the activity on a post. Predicting First Activity Time On A Post
  • 19.  In the application, a user can provide the tags he/she would be using for their posts.  Based on the tags provided, the application will calculate the average time taken for an activity on each tag and then average the two results. How This Works In The Application
  • 20.  Creating a graph structure based on Posts and Related Posts.  Graph will comprise of Nodes and Edges.  Each Node will have several Edges and each Edges will be a Node again will several Edges.  Created a Pig UDF where all the Posts and Related Posts are sent as a Group.  Based on the input a graph gets created.  Rank is calculated based on how many incoming links each Node has.  The more the number of incoming links, the higher the Page Rank. How We Did It
  • 21.  Integrated Hive with the existing Hbase table.  We need to provide the hbase.columns.mapping whereas hbase.table.name is optional to provide.  We use HbaseStorage Handler to allow Hive to interact with Hbase. Hive Hbase Integration
  • 22.  HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results.  We used the Hive Thrift Server to connect with the Hive Tables from the Web Application.  Starting the Hive Thrift Server: hive –service hiveserver  Connection String: Hive Thrift Server
  • 23.  Providing Suggestions to users regarding the various questions they can answer from other categories.  We have taken the User ID, Category ID and the Interaction level as the input to Mahout User Recommender. Mahout User Based Recommender
  • 24.  We used pig queries to join the various tables and get an output which contained User ID, Category ID and Interaction level.  We used this output as an input to the Mahout User Based Recommender.  We converted the Interaction Level values to be in the range of 0 to 5.  We used the PearsonCorrelationSimilarity and the NearestNNeighbours as the neighborhood.  We then used the UserBased Recommender to provide 3 suggestions of other Categories for which the user can provide his contribution by answering the questions. How Did We Implement It
  • 25. Web Application: Mahout Recommender
  • 26.  We were able to incorporate our analysis in a Web Appplication.  The Web Application retrieves the required data using Hbase and Hive technologies.  Below attached are screenshots of the application and the analysis that has been performed.  We have used Google Charts for displaying our analysis in a graph. Web Application
  • 27. Questions Posted By User: Used HBase
  • 28. Tag Count Analysis: Most Used Tags
  • 37. Internal Page Rank Analysis
  • 39.  Performance depends upon input sizes and MR FS chunk size.  While there were queries that required sorting of data, many temp files were created and written onto the disc.  The performance of MR is evaluated by reviewing the counters for map task.  In the Parser Implemented to read the xml file, there were significant problems faced.  The number of spilled records were significantly more than the map task read that resulted in NullPointerException with the message: INFO mapreduce.Job: Job job_local1747290386_0001 failed with state FAILED due to: NA Problem Faced: