SlideShare a Scribd company logo
EFFICIENT FOCUSED WEB CRAWLING
APPROACH FOR SEARCH ENGINE
Research Article published in
IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551
OUTLINE
A. Introduction
B. Focused web crawlers
C. Various existing method
D. Stepwise proposed method
E. Results
F. Conclusion and future work
G. References
INTRODUCTION OF WEB CRAWLERS
 A Web crawler is a key component inside a search engine.
Web crawling is the process by which we gather pages from
the Web, in order to index them and support a search engine.
 The objective of crawling is to quickly and efficiently gather as
many useful web pages as possible, together with the link
structure that interconnects them.
 Web crawlers are mainly used to create a copy of all the
visited pages for later processing by a search engine that will
index the downloaded pages to provide fast searches.
FOCUSED WEB CRAWLERS
A focused crawler is web crawler that attempts to download only
web pages that are relevant to pre defined topic or set of topic.A
focused crawler tries to get the most promising links, and ignore
the off- topic document.
Crawler
VARIOUS EXISTING METHOD
 Breadth-First Crawling- This is the simplest crawling method
in this method,we retrieve all the pages around the starting
point before following links further away from the start.
 Depth-First Crawling- In Depth-first crawling follow all links
from the first link on the starting page, and follow the first link
on the second page, and this process continue. Once the first
link is indexed than follow the second link of first page and
subsequent links, and follows them.
 Fish Search- The web is crawled by a team of crawlers,
which are viewed as a school of fish. If the fish finds a relevant
page based on the keywords specified in query, it continues
looking by following more links from that page. If the page is
not relevant, then his child links receive low preferential value.
 Shark Search- It is the modification of fish search. It is
differing in two ways: a child inherits a discounted value of the
score of its parent, and this score is combined with a value
based on the anchor text that occurs around the link in the
web page.
We have many more methods for web crawling like-
Page Rank algorithm , Hits Algorithm, etc.
STEPWISE PROPOSED METHOD
In the methodology which is describe here is basically a web
analysis method-
Web page = Text Content + Embedded Links
 we include the synonyms and sub synonyms of particular term
while calculating the term frequency.
 we also count the relevance of the page by considering the
links of the relevant web page.
Step 1-
 Scan the database and get data:
- download the all web page content from the database.
- Fetch the number of hyper link of web pages.
Step 2-
 Weight Table Construction :
- Calculating the term weight using the term frequency(TF) and
document frequency(DF) by using this formula-
Wi = TF * DF
- Normalize the weight by the given formula-
Wi+1 = Wi / Wmax
and construct the topic weight table construction.
Step 3 -
 Calculate the relevance of page:
- Calculate the topic relevancy of page corresponding to topic
keyword in the table by using the equation-
Relevancy (t, p) = ∑ Wkt * Wkp / √∑Wkt^2 * Wkp^2
Where,
t = text
p = page
Wk(t)^2 and Wk(p)^2 are the weight of i-th common keyword in
weight table t and web page p respectively.
Wk(t) and Wk(p) are the weight of keyword in web page p and
weight table t respectively.
Step 4-
 Link Ranking calculation:
The Links Ranking assigns scores to unvisited Links
extracted from the downloaded page using the information
of pages.
LinkScore(k)=α + β + γ + ∞
α = the relevancy between topic keywords and href
information .
β = the relevancy between topic keywords and anchor text
γ = the page relevancy score of page from which link was
extracted .
∞ = the relevancy between text surrounding the link and topic
keyword.
Relevant URLs and their score is stored in relevant URL
buffer.
RESULTS
 In order to evaluate the performance of algorithm, we use
precision to estimate the efficiency of a focused crawling
strategy. It is the ratio of topic pages in all of the downloaded
pages. The formula is shown as follows:
Precision rate= relevant pages/ total downloaded page
After applying the propose step on seed URL and comparing results
with other focused algorithm we can say that this method gives
more precision(60% approx.) results.
 As number of term frequency is increase, so the weight of the
keyword is increase, and the relevancy of the web page is
increase, so the number of relevant web page is increase.
CONCLUSION AND FUTURE WORK
 We proposed a method for focused web crawling that allows
to the crawler to go through several relevant pages are
missing. From the above step explain in the proposed method
we get better performance than existing method.
 A major open issue for future work is to do extension test with
large volume of web pages.
 Future work also includes code optimization and URL queue
optimization.
REFERENCES
 Qu Cheng, Wang Beizhan, Wei Pianpian, “Efficient Focused Crawling Strategy Using
Combination of Link Structure and Content Similarity”, Software School, Xiamen
University, Xiamen 361005, Fujian, China, Proceedings of 2008 IEEE International
Symposium on IT in Medicine and Education, 978-1-4244-2511- 2/08/$25.00 ©2008
IEEE.
 Meenu, Priyanka Singla, Rakesh Batra, “Design of a Focused Crawler Based on
Dynamic Computation of Topic Specific Weight Table” International Journal of
Engineering Research and General Science Volume 2, Issue 4, June-July, 2014 ISSN
2091-2730.
 Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava, “Effective Focused Crawling Based
on Content and Link Structure Analysis” (IJCSIS) International Journal of Computer
Science and Information Security, Vol. 2, No. 1, June 2009.
 Bireshwar Gangly, Rahila Sheikh, “A Review of Focused Web Crawling Strategies”
International Journal of Advanced Computer Research, volume 2, number 4 issue 6,
December 2012.
 Jaira Dubey, Divakar Singh, “A Survey on Web Crawler”, International Journal of Of
Electrical, Electronic and Computer System, ISSN (Online): 2347-2820, Volume-1, Issue
-1, 2013.
 Meenu, Rakesh Batra, “A Review of Focused Crawler Approaches”, International Journal
of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue
7, July 2014
THANK YOU

More Related Content

What's hot (20)

PPTX
Keyword Research Presentation
Rex Freiberger
 
PPT
Search Engines
Shamprasad Pujar
 
PDF
Web Scraping
Carlos Rodriguez
 
PDF
Intro to web scraping with Python
Maris Lemba
 
PDF
Keyword Research Presentation .pdf
TheoRuby1
 
PPTX
Introduction to Data Mining
DataminingTools Inc
 
PDF
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
PPTX
Search Engine Powerpoint
201014161
 
PPTX
Search Engine
Ankush Srivastava
 
PPT
Basic Searching Skills
dansich
 
PPTX
Elastic Search
Navule Rao
 
PPTX
2015 google analytics basics
The Karcher Group
 
PPTX
Technical seo tips for web developers
Singsys Pte Ltd
 
PPTX
What is a Robot txt file?
Digital Marketing Tatva
 
PDF
CS8080 IRT UNIT I NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
PPT
Search engine
Alisha Korpal
 
PDF
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
ODP
Introduction to MongoDB
Dineesha Suraweera
 
PDF
Intelligent web crawling
Denis Shestakov
 
Keyword Research Presentation
Rex Freiberger
 
Search Engines
Shamprasad Pujar
 
Web Scraping
Carlos Rodriguez
 
Intro to web scraping with Python
Maris Lemba
 
Keyword Research Presentation .pdf
TheoRuby1
 
Introduction to Data Mining
DataminingTools Inc
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
Search Engine Powerpoint
201014161
 
Search Engine
Ankush Srivastava
 
Basic Searching Skills
dansich
 
Elastic Search
Navule Rao
 
2015 google analytics basics
The Karcher Group
 
Technical seo tips for web developers
Singsys Pte Ltd
 
What is a Robot txt file?
Digital Marketing Tatva
 
Search engine
Alisha Korpal
 
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Introduction to MongoDB
Dineesha Suraweera
 
Intelligent web crawling
Denis Shestakov
 

Similar to Efficient focused web crawling approach (20)

PDF
Topic-specific Web Crawler using Probability Method
IOSR Journals
 
PDF
Pdd crawler a focused web
csandit
 
PDF
Focused web crawling using named entity recognition for narrow domains
eSAT Publishing House
 
PDF
Focused web crawling using named entity recognition for narrow domains
eSAT Journals
 
PPT
“Web crawler”
ranjit banshpal
 
PDF
Smart Crawler Automation with RMI
IRJET Journal
 
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
PDF
E017624043
IOSR Journals
 
PDF
407 409
Editor IJARCET
 
PDF
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET Journal
 
DOCX
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
PDF
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
rahulmonikasharma
 
PDF
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
PDF
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
IJMTST Journal
 
PDF
Sree saranya
sreesaranya
 
PDF
Sree saranya
sreesaranya
 
PPTX
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
S Sai Karthik
 
PPT
Web crawler
anusha kurapati
 
PPT
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
Topic-specific Web Crawler using Probability Method
IOSR Journals
 
Pdd crawler a focused web
csandit
 
Focused web crawling using named entity recognition for narrow domains
eSAT Publishing House
 
Focused web crawling using named entity recognition for narrow domains
eSAT Journals
 
“Web crawler”
ranjit banshpal
 
Smart Crawler Automation with RMI
IRJET Journal
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
E017624043
IOSR Journals
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET Journal
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
rahulmonikasharma
 
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
IJMTST Journal
 
Sree saranya
sreesaranya
 
Sree saranya
sreesaranya
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
S Sai Karthik
 
Web crawler
anusha kurapati
 
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
Ad

Recently uploaded (20)

PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
July Patch Tuesday
Ivanti
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
July Patch Tuesday
Ivanti
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Ad

Efficient focused web crawling approach

  • 1. EFFICIENT FOCUSED WEB CRAWLING APPROACH FOR SEARCH ENGINE Research Article published in IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551
  • 2. OUTLINE A. Introduction B. Focused web crawlers C. Various existing method D. Stepwise proposed method E. Results F. Conclusion and future work G. References
  • 3. INTRODUCTION OF WEB CRAWLERS  A Web crawler is a key component inside a search engine. Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine.  The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them.  Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
  • 4. FOCUSED WEB CRAWLERS A focused crawler is web crawler that attempts to download only web pages that are relevant to pre defined topic or set of topic.A focused crawler tries to get the most promising links, and ignore the off- topic document. Crawler
  • 5. VARIOUS EXISTING METHOD  Breadth-First Crawling- This is the simplest crawling method in this method,we retrieve all the pages around the starting point before following links further away from the start.  Depth-First Crawling- In Depth-first crawling follow all links from the first link on the starting page, and follow the first link on the second page, and this process continue. Once the first link is indexed than follow the second link of first page and subsequent links, and follows them.
  • 6.  Fish Search- The web is crawled by a team of crawlers, which are viewed as a school of fish. If the fish finds a relevant page based on the keywords specified in query, it continues looking by following more links from that page. If the page is not relevant, then his child links receive low preferential value.  Shark Search- It is the modification of fish search. It is differing in two ways: a child inherits a discounted value of the score of its parent, and this score is combined with a value based on the anchor text that occurs around the link in the web page. We have many more methods for web crawling like- Page Rank algorithm , Hits Algorithm, etc.
  • 7. STEPWISE PROPOSED METHOD In the methodology which is describe here is basically a web analysis method- Web page = Text Content + Embedded Links  we include the synonyms and sub synonyms of particular term while calculating the term frequency.  we also count the relevance of the page by considering the links of the relevant web page.
  • 8. Step 1-  Scan the database and get data: - download the all web page content from the database. - Fetch the number of hyper link of web pages. Step 2-  Weight Table Construction : - Calculating the term weight using the term frequency(TF) and document frequency(DF) by using this formula- Wi = TF * DF - Normalize the weight by the given formula- Wi+1 = Wi / Wmax and construct the topic weight table construction.
  • 9. Step 3 -  Calculate the relevance of page: - Calculate the topic relevancy of page corresponding to topic keyword in the table by using the equation- Relevancy (t, p) = ∑ Wkt * Wkp / √∑Wkt^2 * Wkp^2 Where, t = text p = page Wk(t)^2 and Wk(p)^2 are the weight of i-th common keyword in weight table t and web page p respectively. Wk(t) and Wk(p) are the weight of keyword in web page p and weight table t respectively.
  • 10. Step 4-  Link Ranking calculation: The Links Ranking assigns scores to unvisited Links extracted from the downloaded page using the information of pages. LinkScore(k)=α + β + γ + ∞ α = the relevancy between topic keywords and href information . β = the relevancy between topic keywords and anchor text γ = the page relevancy score of page from which link was extracted . ∞ = the relevancy between text surrounding the link and topic keyword. Relevant URLs and their score is stored in relevant URL buffer.
  • 11. RESULTS  In order to evaluate the performance of algorithm, we use precision to estimate the efficiency of a focused crawling strategy. It is the ratio of topic pages in all of the downloaded pages. The formula is shown as follows: Precision rate= relevant pages/ total downloaded page After applying the propose step on seed URL and comparing results with other focused algorithm we can say that this method gives more precision(60% approx.) results.  As number of term frequency is increase, so the weight of the keyword is increase, and the relevancy of the web page is increase, so the number of relevant web page is increase.
  • 12. CONCLUSION AND FUTURE WORK  We proposed a method for focused web crawling that allows to the crawler to go through several relevant pages are missing. From the above step explain in the proposed method we get better performance than existing method.  A major open issue for future work is to do extension test with large volume of web pages.  Future work also includes code optimization and URL queue optimization.
  • 13. REFERENCES  Qu Cheng, Wang Beizhan, Wei Pianpian, “Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity”, Software School, Xiamen University, Xiamen 361005, Fujian, China, Proceedings of 2008 IEEE International Symposium on IT in Medicine and Education, 978-1-4244-2511- 2/08/$25.00 ©2008 IEEE.  Meenu, Priyanka Singla, Rakesh Batra, “Design of a Focused Crawler Based on Dynamic Computation of Topic Specific Weight Table” International Journal of Engineering Research and General Science Volume 2, Issue 4, June-July, 2014 ISSN 2091-2730.  Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava, “Effective Focused Crawling Based on Content and Link Structure Analysis” (IJCSIS) International Journal of Computer Science and Information Security, Vol. 2, No. 1, June 2009.  Bireshwar Gangly, Rahila Sheikh, “A Review of Focused Web Crawling Strategies” International Journal of Advanced Computer Research, volume 2, number 4 issue 6, December 2012.  Jaira Dubey, Divakar Singh, “A Survey on Web Crawler”, International Journal of Of Electrical, Electronic and Computer System, ISSN (Online): 2347-2820, Volume-1, Issue -1, 2013.  Meenu, Rakesh Batra, “A Review of Focused Crawler Approaches”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 7, July 2014