SlideShare a Scribd company logo
2
Most read
4
Most read
8
Most read
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY
MBA
SEMESTER: 3
SPECIALIZATION
BUSINESS ANALYTICS (BA 2)
SUBJECT
DATA MINING
MODULE NO : 5
WEB MINING & TEXT MINING
- Jayanti R Pande
DGICM College, Nagpur
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q1. What is web mining. Explain its process.
Web mining is the process of using data-mining techniques to automatically extract information from various sources on the
web. It involves discovering insights from web documents and services, encompassing a range of tasks beyond just applying
standard data-mining tools.
PROCESS OF WEB MINING
1.Resource finding: This involves retrieving data from multimedia sources available online or offline, such as news articles,
forums, blogs, and HTML documents. It includes extracting text content from HTML documents by removing HTML tags.
2.Information selection and pre-processing: In this step, the original data obtained in the previous subtask undergoes
transformations. These transformations may include prep-rocessing tasks like removing stop words, stemming, or
restructuring the data to achieve the desired representation, such as identifying phrases in the training corpus or converting
text into first-order logic form.
3.Generalization: Generalization is the process of discovering general patterns within individual websites as well as across
multiple sites. Various machine-learning techniques, data-mining methods, and specialized web-oriented approaches are
utilized to identify these patterns.
4.Analysis: This final task involves validating and interpreting the patterns mined from the data. It includes assessing the
significance and relevance of the discovered patterns in relation to the objectives of the web mining process.
1
Resource finding
2
Information selection
and pre-processing
3
Generalization
4
Analysis
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Web Content Web Structure Web Usage
- Involves unstructured data, primarily text
documents.
- Deals with semi-structured data found in
hypertext documents.
- Focuses on interactive aspects of web
data, including link structures and
server/browser logs.
- Analysis typically involves machine
learning techniques, including statistical
methods like NLP.
- Utilizes proprietary algorithms for
analysis.
- Analysis methods encompass machine
learning and statistical techniques,
especially association rules.
- Data representation often includes
models like bag of words and n-gram terms.
- Representation is often depicted as edged
labeled graphs.
- Data represented using graphs and
relational tables.
- Applications include categorization,
clustering, and pattern identification within
textual data.
- Main applications include identifying
frequent substructures within web
documents and discovering website
schemas.
- Applications range from categorization
and clustering to site construction and rule
extraction from user behavior.
- It focuses on extracting meaningful
information from unstructured text data on
the web.
- It deals with the organization and
relationships between web elements like
pages and links.
- It emphasizes understanding user
behavior and interaction patterns on the
web.
- Techniques such as sentiment analysis and
topic modeling are commonly applied.
- Graph-based algorithms are often used to
analyze connectivity and relationships.
- Usage patterns are analyzed to improve
website design, content delivery, and
marketing strategies.
Q2. Compare Web Content, Web Structure and Web Usage.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q3. Explain the working of HITS Algorithm.
HITS, short for Hyperlink-Induced Topic Search, is an algorithm used for ranking web pages based on their authority and hub
scores. It evaluates the importance of a web page by considering both its authority, which is a measure of its relevance to a
specific topic, and its hub score, which indicates its capacity to link to other authoritative pages on the same topic. HITS
algorithm operates by analysing the link structure of the web and iteratively computing authority and hub scores for web
pages.
1 Root Set Retrieval
2 Base Set Construction
3 Authority and Hub Computation
4 Iteration Process
5 Score Normalization
6 Repeat Iterations
HITS Algorithm Steps
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
The HITS (Hyperlink-Induced Topic Search) algorithm operates in several steps:
1 Root Set Retrieval: Initially, the most relevant pages to the search query are retrieved. This set is termed the root set and is
typically obtained using a text-based search algorithm.
2 Base Set Construction: The root set is expanded by including all pages linked from it and some pages that link to it. This
augmented set forms the base set, ensuring that a substantial number of strong authorities are included. This base set and the
hyperlinks among its pages constitute a focused subgraph.
3 Authority and Hub Computation: Authority and hub values are computed iteratively in a mutually recursive manner. An
authority value is calculated as the sum of the scaled hub values of the pages that point to it, while a hub value is determined as
the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked
pages.
4 Iteration Process:
Authority Update: Each node's authority score is updated to be the sum of the hub scores of each node pointing to it. This
implies that a node gains a high authority score by being linked from pages recognized as hubs for information.
Hub Update: Each node's hub score is updated to be the sum of the authority scores of each node it points to. This means that a
node earns a high hub score by linking to nodes considered authorities on the subject.
These updates are iterated through a series of iterations.
5 Score Normalization: After each iteration, the hub and authority scores are normalized by dividing each hub score by the
square root of the sum of the squares of all hub scores, and each authority score by the square root of the sum of the squares
of all authority scores. This normalization process ensures that the scores remain comparable across iterations.
6 Repeat Iterations: The iterations continue until convergence, where the scores stabilize or until a predefined stopping criterion
is met.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q4. Write about Text Mining.
Text Mining is a process that involves transforming unstructured text into a structured format to uncover meaningful patterns
and insights. It utilizes advanced analytical techniques such as Naïve Bayes, Support Vector Machines (SVM), and deep learning
algorithms to explore hidden relationships within text data.
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify
meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector
Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within
their unstructured data.
Text data can be organized into three main formats within databases:
1. Structured Data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store
and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and
phone numbers.
2. Unstructured Data: This data does not have a predefined format. It can include text from sources like social media or product
reviews, as well as rich media formats like video and audio files.
3. Semi-Structured Data: Semi-structured data is a blend between structured and unstructured formats. While it has some
organization, it lacks enough structure to meet the requirements of a relational database. Examples of semi-structured data
include XML, JSON, and HTML files.
Given that roughly 80% of data in the world resides in an unstructured format, text mining is an extremely valuable practice
within organizations. Text mining tools and natural language processing (NLP) techniques, such as information extraction, enable
the transformation of unstructured documents into a structured format for analysis and the generation of high-quality insights.
This, in turn, improves the decision-making of organizations, leading to better business outcomes.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q5. Explain PageRank Algorithm and its implementation steps.
• The PageRank algorithm, developed by Google, is a method used to determine the importance of web pages in search engine
results. Named after Larry Page, one of Google's founders, PageRank measures the significance of a webpage based on the quantity
and quality of links pointing to it.
• Google describes PageRank as a system that assesses a webpage's importance by considering the number and quality of links it
receives from other pages. The underlying assumption is that more important websites are likely to attract more links from other
websites.
• The algorithm generates a probability distribution to represent the likelihood that a random surfer clicking on links will land on any
particular page. It can be applied to collections of documents of any size, assuming an even distribution of importance among all
documents at the start of the computation.
• In the PageRank computation, a series of iterations, or passes through the collection, are required to adjust the approximate
PageRank values to better reflect the true value. This involves transferring PageRank from a page to the targets of its outbound links,
with the transfer evenly divided among all outbound links.
• For example, in a scenario where web pages B, C, and D link to page A, each link would transfer an equal share of PageRank to A upon
the next iteration, totaling to 0.75.
• In another scenario where page B links to pages C and A, page C links to page A, and page D links to all three pages, the PageRank
transferred to page A in the first iteration is calculated based on the existing values and the number of outbound links from each
linking page. In the general case, the PageRank value for any page u depends on the PageRank values of pages linking to u, divided by
the number of outbound links from each linking page. This calculation involves a damping factor, similar to income tax, which ensures
fairness and stability in the algorithm.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
The implementation of the PageRank algorithm involves several steps:
1.Data Collection: Gather information about web pages and their links. This typically involves crawling the web to create a web graph, where
nodes represent web pages and edges represent links between them.
2.Initialization: Assign an initial PageRank value to each web page. In the original version of PageRank, all pages are given an equal initial value.
However, modern implementations often use a probability distribution between 0 and 1.
3.Iteration: Perform a series of iterations to update the PageRank values. In each iteration, calculate the PageRank for each page based on the
PageRank values of the pages linking to it.
4.Damping Factor: Apply a damping factor to prevent manipulation and ensure fairness in the algorithm. The damping factor represents the
probability that a random surfer will continue clicking on links rather than jumping to a new page.
5.Convergence: Repeat the iteration process until the PageRank values converge to stable values. Convergence occurs when the PageRank values
no longer change significantly between iterations.
6.Normalization: Normalize the PageRank values to ensure they sum up to 1. This step ensures that the PageRank values represent a probability
distribution.
7.Implementation Considerations: Implement efficient data structures and algorithms to handle large-scale web graphs. This may involve
distributed computing techniques and optimizations to improve performance and scalability.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Copyright © 2024 Jayanti Rajdevendra Pande.
All rights reserved.
This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose
without the express written permission of the copyright owner.
This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and
other applicable laws.
For any further queries contact on email: jayantipande17@gmail.com

More Related Content

What's hot (16)

PDF
Retail Sales Mod 2.pdf
Jayanti Pande
 
PDF
Retail Sales Mod 1.pdf
Jayanti Pande
 
PDF
Sales & Distribution Management Module 2.pdf
Jayanti Pande
 
PDF
Team_Dynamics_Mod_1_Part_2.pdf
Jayanti Pande
 
PDF
Retail Sales Mod 5.pdf
Jayanti Pande
 
PDF
Retail Sales Mod 3.pdf
Jayanti Pande
 
PDF
Sales & Distribution Management Module 3.pdf
Jayanti Pande
 
PDF
Retail Sales Mod 4.pdf
Jayanti Pande
 
PDF
Exit Seminar By Prajakta.pdf
Jayanti Pande
 
PDF
Marketing Management Paper 3 Module 1.pdf
Jayanti Pande
 
PDF
Sales & Distribution Management Module 1.pdf
Jayanti Pande
 
PDF
Marketing Management Paper 3 Module 2.pdf
Jayanti Pande
 
PPTX
Kindling easy sm 010415
Ayesha Patel
 
PDF
HR_Mod_1_Summary.pdf
Jayanti Pande
 
PPTX
Unit 4 Sales Management
Mansi Tyagi
 
PDF
Module 1_Theory.pdf
KeerthiNS6
 
Retail Sales Mod 2.pdf
Jayanti Pande
 
Retail Sales Mod 1.pdf
Jayanti Pande
 
Sales & Distribution Management Module 2.pdf
Jayanti Pande
 
Team_Dynamics_Mod_1_Part_2.pdf
Jayanti Pande
 
Retail Sales Mod 5.pdf
Jayanti Pande
 
Retail Sales Mod 3.pdf
Jayanti Pande
 
Sales & Distribution Management Module 3.pdf
Jayanti Pande
 
Retail Sales Mod 4.pdf
Jayanti Pande
 
Exit Seminar By Prajakta.pdf
Jayanti Pande
 
Marketing Management Paper 3 Module 1.pdf
Jayanti Pande
 
Sales & Distribution Management Module 1.pdf
Jayanti Pande
 
Marketing Management Paper 3 Module 2.pdf
Jayanti Pande
 
Kindling easy sm 010415
Ayesha Patel
 
HR_Mod_1_Summary.pdf
Jayanti Pande
 
Unit 4 Sales Management
Mansi Tyagi
 
Module 1_Theory.pdf
KeerthiNS6
 

Similar to Data Mining Module 5 Business Analytics.pdf (20)

PPT
Web mining
MohamadHayeri1
 
PDF
WEBMINING_SOWMYAJYOTHI.pdf
SowmyaJyothi3
 
PPTX
WEB MINING.pptx
HarshithRaj21
 
PPTX
Web Mining & Text Mining
Hemant Sharma
 
PPT
4.5 mining the worldwideweb
Krish_ver2
 
PDF
Web content mining a case study for bput results
eSAT Publishing House
 
PDF
Web content minin
eSAT Journals
 
PDF
International conference On Computer Science And technology
anchalsinghdm
 
PPTX
Discovering knowledge using web structure mining
Atul Khanna
 
PPTX
Web mining
Rashmi Bhat
 
PPTX
Web Mining.pptx
ScrbifPt
 
PPTX
Web mining
Jay Lohokare
 
PPTX
Web Mining Presentation Final
Er. Jagrat Gupta
 
PDF
Aa03401490154
ijceronline
 
PPTX
Web mining: Concepts and applications
Utkarsh Sharma
 
PDF
Pagerank and hits
Shatakirti Er
 
DOCX
Web Mining
Shobha Rani
 
PDF
A Study on Web Structure Mining
IRJET Journal
 
PDF
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
PDF
A Study On Web Structure Mining
Nicole Heredia
 
Web mining
MohamadHayeri1
 
WEBMINING_SOWMYAJYOTHI.pdf
SowmyaJyothi3
 
WEB MINING.pptx
HarshithRaj21
 
Web Mining & Text Mining
Hemant Sharma
 
4.5 mining the worldwideweb
Krish_ver2
 
Web content mining a case study for bput results
eSAT Publishing House
 
Web content minin
eSAT Journals
 
International conference On Computer Science And technology
anchalsinghdm
 
Discovering knowledge using web structure mining
Atul Khanna
 
Web mining
Rashmi Bhat
 
Web Mining.pptx
ScrbifPt
 
Web mining
Jay Lohokare
 
Web Mining Presentation Final
Er. Jagrat Gupta
 
Aa03401490154
ijceronline
 
Web mining: Concepts and applications
Utkarsh Sharma
 
Pagerank and hits
Shatakirti Er
 
Web Mining
Shobha Rani
 
A Study on Web Structure Mining
IRJET Journal
 
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
A Study On Web Structure Mining
Nicole Heredia
 
Ad

More from Jayanti Pande (20)

PDF
UGC NET 2025 Current Affairs Module 3.pdf
Jayanti Pande
 
PDF
UGC NET 2025 Current Affairs Module 2.pdf
Jayanti Pande
 
PDF
UGC NET 2025 Current Affairs Module 1.pdf
Jayanti Pande
 
PDF
BBA Business Law Unit 4 Summary Notes.pdf
Jayanti Pande
 
PDF
BBA Business Law Unit 3 Summary Notes.pdf
Jayanti Pande
 
PDF
BBA Business Law Unit 2 Summary Notes.pdf
Jayanti Pande
 
PDF
BBA Business Law Unit 1 Summary Notes.pdf
Jayanti Pande
 
PDF
Asst Prof most probable Interview Questions.pdf
Jayanti Pande
 
PDF
Digital and Social Media Marketing Module 2.pdf
Jayanti Pande
 
PDF
Digital & Social Media Marketing Module 1.pdf
Jayanti Pande
 
PDF
Marketing Management Paper 3 Module 5.pdf
Jayanti Pande
 
PDF
Marketing Management Paper 3 Module 4.pdf
Jayanti Pande
 
PDF
Marketing Management Paper 3 Module 3 .pdf
Jayanti Pande
 
PDF
World Tread Organization [WTO] Overview.pdf
Jayanti Pande
 
PDF
Research Aptitude MCQ Series 1 for MAH SET Exam.pdf
Jayanti Pande
 
PDF
Strategy to qualify MH SET Exam in Management.pdf
Jayanti Pande
 
PDF
Digital Marketing Careers after MBA..pdf
Jayanti Pande
 
PDF
HRM Guide| Covering All HRM important topics | Best for Interview Preparation...
Jayanti Pande
 
PDF
6 MBA Labour Laws and Acts important for Professionals.pdf
Jayanti Pande
 
PDF
Compensation & Benefits Management Module 3.pdf
Jayanti Pande
 
UGC NET 2025 Current Affairs Module 3.pdf
Jayanti Pande
 
UGC NET 2025 Current Affairs Module 2.pdf
Jayanti Pande
 
UGC NET 2025 Current Affairs Module 1.pdf
Jayanti Pande
 
BBA Business Law Unit 4 Summary Notes.pdf
Jayanti Pande
 
BBA Business Law Unit 3 Summary Notes.pdf
Jayanti Pande
 
BBA Business Law Unit 2 Summary Notes.pdf
Jayanti Pande
 
BBA Business Law Unit 1 Summary Notes.pdf
Jayanti Pande
 
Asst Prof most probable Interview Questions.pdf
Jayanti Pande
 
Digital and Social Media Marketing Module 2.pdf
Jayanti Pande
 
Digital & Social Media Marketing Module 1.pdf
Jayanti Pande
 
Marketing Management Paper 3 Module 5.pdf
Jayanti Pande
 
Marketing Management Paper 3 Module 4.pdf
Jayanti Pande
 
Marketing Management Paper 3 Module 3 .pdf
Jayanti Pande
 
World Tread Organization [WTO] Overview.pdf
Jayanti Pande
 
Research Aptitude MCQ Series 1 for MAH SET Exam.pdf
Jayanti Pande
 
Strategy to qualify MH SET Exam in Management.pdf
Jayanti Pande
 
Digital Marketing Careers after MBA..pdf
Jayanti Pande
 
HRM Guide| Covering All HRM important topics | Best for Interview Preparation...
Jayanti Pande
 
6 MBA Labour Laws and Acts important for Professionals.pdf
Jayanti Pande
 
Compensation & Benefits Management Module 3.pdf
Jayanti Pande
 
Ad

Recently uploaded (20)

PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PDF
community health nursing question paper 2.pdf
Prince kumar
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
PPTX
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
Dimensions of Societal Planning in Commonism
StefanMz
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
community health nursing question paper 2.pdf
Prince kumar
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 

Data Mining Module 5 Business Analytics.pdf

  • 1. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY MBA SEMESTER: 3 SPECIALIZATION BUSINESS ANALYTICS (BA 2) SUBJECT DATA MINING MODULE NO : 5 WEB MINING & TEXT MINING - Jayanti R Pande DGICM College, Nagpur
  • 2. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q1. What is web mining. Explain its process. Web mining is the process of using data-mining techniques to automatically extract information from various sources on the web. It involves discovering insights from web documents and services, encompassing a range of tasks beyond just applying standard data-mining tools. PROCESS OF WEB MINING 1.Resource finding: This involves retrieving data from multimedia sources available online or offline, such as news articles, forums, blogs, and HTML documents. It includes extracting text content from HTML documents by removing HTML tags. 2.Information selection and pre-processing: In this step, the original data obtained in the previous subtask undergoes transformations. These transformations may include prep-rocessing tasks like removing stop words, stemming, or restructuring the data to achieve the desired representation, such as identifying phrases in the training corpus or converting text into first-order logic form. 3.Generalization: Generalization is the process of discovering general patterns within individual websites as well as across multiple sites. Various machine-learning techniques, data-mining methods, and specialized web-oriented approaches are utilized to identify these patterns. 4.Analysis: This final task involves validating and interpreting the patterns mined from the data. It includes assessing the significance and relevance of the discovered patterns in relation to the objectives of the web mining process. 1 Resource finding 2 Information selection and pre-processing 3 Generalization 4 Analysis
  • 3. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Web Content Web Structure Web Usage - Involves unstructured data, primarily text documents. - Deals with semi-structured data found in hypertext documents. - Focuses on interactive aspects of web data, including link structures and server/browser logs. - Analysis typically involves machine learning techniques, including statistical methods like NLP. - Utilizes proprietary algorithms for analysis. - Analysis methods encompass machine learning and statistical techniques, especially association rules. - Data representation often includes models like bag of words and n-gram terms. - Representation is often depicted as edged labeled graphs. - Data represented using graphs and relational tables. - Applications include categorization, clustering, and pattern identification within textual data. - Main applications include identifying frequent substructures within web documents and discovering website schemas. - Applications range from categorization and clustering to site construction and rule extraction from user behavior. - It focuses on extracting meaningful information from unstructured text data on the web. - It deals with the organization and relationships between web elements like pages and links. - It emphasizes understanding user behavior and interaction patterns on the web. - Techniques such as sentiment analysis and topic modeling are commonly applied. - Graph-based algorithms are often used to analyze connectivity and relationships. - Usage patterns are analyzed to improve website design, content delivery, and marketing strategies. Q2. Compare Web Content, Web Structure and Web Usage.
  • 4. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q3. Explain the working of HITS Algorithm. HITS, short for Hyperlink-Induced Topic Search, is an algorithm used for ranking web pages based on their authority and hub scores. It evaluates the importance of a web page by considering both its authority, which is a measure of its relevance to a specific topic, and its hub score, which indicates its capacity to link to other authoritative pages on the same topic. HITS algorithm operates by analysing the link structure of the web and iteratively computing authority and hub scores for web pages. 1 Root Set Retrieval 2 Base Set Construction 3 Authority and Hub Computation 4 Iteration Process 5 Score Normalization 6 Repeat Iterations HITS Algorithm Steps
  • 5. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. The HITS (Hyperlink-Induced Topic Search) algorithm operates in several steps: 1 Root Set Retrieval: Initially, the most relevant pages to the search query are retrieved. This set is termed the root set and is typically obtained using a text-based search algorithm. 2 Base Set Construction: The root set is expanded by including all pages linked from it and some pages that link to it. This augmented set forms the base set, ensuring that a substantial number of strong authorities are included. This base set and the hyperlinks among its pages constitute a focused subgraph. 3 Authority and Hub Computation: Authority and hub values are computed iteratively in a mutually recursive manner. An authority value is calculated as the sum of the scaled hub values of the pages that point to it, while a hub value is determined as the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages. 4 Iteration Process: Authority Update: Each node's authority score is updated to be the sum of the hub scores of each node pointing to it. This implies that a node gains a high authority score by being linked from pages recognized as hubs for information. Hub Update: Each node's hub score is updated to be the sum of the authority scores of each node it points to. This means that a node earns a high hub score by linking to nodes considered authorities on the subject. These updates are iterated through a series of iterations. 5 Score Normalization: After each iteration, the hub and authority scores are normalized by dividing each hub score by the square root of the sum of the squares of all hub scores, and each authority score by the square root of the sum of the squares of all authority scores. This normalization process ensures that the scores remain comparable across iterations. 6 Repeat Iterations: The iterations continue until convergence, where the scores stabilize or until a predefined stopping criterion is met.
  • 6. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q4. Write about Text Mining. Text Mining is a process that involves transforming unstructured text into a structured format to uncover meaningful patterns and insights. It utilizes advanced analytical techniques such as Naïve Bayes, Support Vector Machines (SVM), and deep learning algorithms to explore hidden relationships within text data. Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data. Text data can be organized into three main formats within databases: 1. Structured Data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and phone numbers. 2. Unstructured Data: This data does not have a predefined format. It can include text from sources like social media or product reviews, as well as rich media formats like video and audio files. 3. Semi-Structured Data: Semi-structured data is a blend between structured and unstructured formats. While it has some organization, it lacks enough structure to meet the requirements of a relational database. Examples of semi-structured data include XML, JSON, and HTML files. Given that roughly 80% of data in the world resides in an unstructured format, text mining is an extremely valuable practice within organizations. Text mining tools and natural language processing (NLP) techniques, such as information extraction, enable the transformation of unstructured documents into a structured format for analysis and the generation of high-quality insights. This, in turn, improves the decision-making of organizations, leading to better business outcomes.
  • 7. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q5. Explain PageRank Algorithm and its implementation steps. • The PageRank algorithm, developed by Google, is a method used to determine the importance of web pages in search engine results. Named after Larry Page, one of Google's founders, PageRank measures the significance of a webpage based on the quantity and quality of links pointing to it. • Google describes PageRank as a system that assesses a webpage's importance by considering the number and quality of links it receives from other pages. The underlying assumption is that more important websites are likely to attract more links from other websites. • The algorithm generates a probability distribution to represent the likelihood that a random surfer clicking on links will land on any particular page. It can be applied to collections of documents of any size, assuming an even distribution of importance among all documents at the start of the computation. • In the PageRank computation, a series of iterations, or passes through the collection, are required to adjust the approximate PageRank values to better reflect the true value. This involves transferring PageRank from a page to the targets of its outbound links, with the transfer evenly divided among all outbound links. • For example, in a scenario where web pages B, C, and D link to page A, each link would transfer an equal share of PageRank to A upon the next iteration, totaling to 0.75. • In another scenario where page B links to pages C and A, page C links to page A, and page D links to all three pages, the PageRank transferred to page A in the first iteration is calculated based on the existing values and the number of outbound links from each linking page. In the general case, the PageRank value for any page u depends on the PageRank values of pages linking to u, divided by the number of outbound links from each linking page. This calculation involves a damping factor, similar to income tax, which ensures fairness and stability in the algorithm.
  • 8. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. The implementation of the PageRank algorithm involves several steps: 1.Data Collection: Gather information about web pages and their links. This typically involves crawling the web to create a web graph, where nodes represent web pages and edges represent links between them. 2.Initialization: Assign an initial PageRank value to each web page. In the original version of PageRank, all pages are given an equal initial value. However, modern implementations often use a probability distribution between 0 and 1. 3.Iteration: Perform a series of iterations to update the PageRank values. In each iteration, calculate the PageRank for each page based on the PageRank values of the pages linking to it. 4.Damping Factor: Apply a damping factor to prevent manipulation and ensure fairness in the algorithm. The damping factor represents the probability that a random surfer will continue clicking on links rather than jumping to a new page. 5.Convergence: Repeat the iteration process until the PageRank values converge to stable values. Convergence occurs when the PageRank values no longer change significantly between iterations. 6.Normalization: Normalize the PageRank values to ensure they sum up to 1. This step ensures that the PageRank values represent a probability distribution. 7.Implementation Considerations: Implement efficient data structures and algorithms to handle large-scale web graphs. This may involve distributed computing techniques and optimizations to improve performance and scalability.
  • 9. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose without the express written permission of the copyright owner. This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and other applicable laws. For any further queries contact on email: [email protected]