Efficient focused web crawling approach

EFFICIENT FOCUSED WEB CRAWLING
APPROACH FOR SEARCH ENGINE
Research Article published in
IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551

OUTLINE
A. Introduction
B. Focused web crawlers
C. Various existing method
D. Stepwise proposed method
E. Results
F. Conclusion and future work
G. References

INTRODUCTION OF WEB CRAWLERS
 A Web crawler is a key component inside a search engine.
Web crawling is the process by which we gather pages from
the Web, in order to index them and support a search engine.
 The objective of crawling is to quickly and efficiently gather as
many useful web pages as possible, together with the link
structure that interconnects them.
 Web crawlers are mainly used to create a copy of all the
visited pages for later processing by a search engine that will
index the downloaded pages to provide fast searches.

FOCUSED WEB CRAWLERS
A focused crawler is web crawler that attempts to download only
web pages that are relevant to pre defined topic or set of topic.A
focused crawler tries to get the most promising links, and ignore
the off- topic document.
Crawler

VARIOUS EXISTING METHOD
 Breadth-First Crawling- This is the simplest crawling method
in this method,we retrieve all the pages around the starting
point before following links further away from the start.
 Depth-First Crawling- In Depth-first crawling follow all links
from the first link on the starting page, and follow the first link
on the second page, and this process continue. Once the first
link is indexed than follow the second link of first page and
subsequent links, and follows them.

 Fish Search- The web is crawled by a team of crawlers,
which are viewed as a school of fish. If the fish finds a relevant
page based on the keywords specified in query, it continues
looking by following more links from that page. If the page is
not relevant, then his child links receive low preferential value.
 Shark Search- It is the modification of fish search. It is
differing in two ways: a child inherits a discounted value of the
score of its parent, and this score is combined with a value
based on the anchor text that occurs around the link in the
web page.
We have many more methods for web crawling like-
Page Rank algorithm , Hits Algorithm, etc.

STEPWISE PROPOSED METHOD
In the methodology which is describe here is basically a web
analysis method-
Web page = Text Content + Embedded Links
 we include the synonyms and sub synonyms of particular term
while calculating the term frequency.
 we also count the relevance of the page by considering the
links of the relevant web page.

Step 1-
 Scan the database and get data:
- download the all web page content from the database.
- Fetch the number of hyper link of web pages.
Step 2-
 Weight Table Construction :
- Calculating the term weight using the term frequency(TF) and
document frequency(DF) by using this formula-
Wi = TF * DF
- Normalize the weight by the given formula-
Wi+1 = Wi / Wmax
and construct the topic weight table construction.

Step 3 -
 Calculate the relevance of page:
- Calculate the topic relevancy of page corresponding to topic
keyword in the table by using the equation-
Relevancy (t, p) = ∑ Wkt * Wkp / √∑Wkt^2 * Wkp^2
Where,
t = text
p = page
Wk(t)^2 and Wk(p)^2 are the weight of i-th common keyword in
weight table t and web page p respectively.
Wk(t) and Wk(p) are the weight of keyword in web page p and
weight table t respectively.

Step 4-
 Link Ranking calculation:
The Links Ranking assigns scores to unvisited Links
extracted from the downloaded page using the information
of pages.
LinkScore(k)=α + β + γ + ∞
α = the relevancy between topic keywords and href
information .
β = the relevancy between topic keywords and anchor text
γ = the page relevancy score of page from which link was
extracted .
∞ = the relevancy between text surrounding the link and topic
keyword.
Relevant URLs and their score is stored in relevant URL
buffer.

RESULTS
 In order to evaluate the performance of algorithm, we use
precision to estimate the efficiency of a focused crawling
strategy. It is the ratio of topic pages in all of the downloaded
pages. The formula is shown as follows:
Precision rate= relevant pages/ total downloaded page
After applying the propose step on seed URL and comparing results
with other focused algorithm we can say that this method gives
more precision(60% approx.) results.
 As number of term frequency is increase, so the weight of the
keyword is increase, and the relevancy of the web page is
increase, so the number of relevant web page is increase.

CONCLUSION AND FUTURE WORK
 We proposed a method for focused web crawling that allows
to the crawler to go through several relevant pages are
missing. From the above step explain in the proposed method
we get better performance than existing method.
 A major open issue for future work is to do extension test with
large volume of web pages.
 Future work also includes code optimization and URL queue
optimization.

REFERENCES
 Qu Cheng, Wang Beizhan, Wei Pianpian, “Efficient Focused Crawling Strategy Using
Combination of Link Structure and Content Similarity”, Software School, Xiamen
University, Xiamen 361005, Fujian, China, Proceedings of 2008 IEEE International
Symposium on IT in Medicine and Education, 978-1-4244-2511- 2/08/$25.00 ©2008
IEEE.
 Meenu, Priyanka Singla, Rakesh Batra, “Design of a Focused Crawler Based on
Dynamic Computation of Topic Specific Weight Table” International Journal of
Engineering Research and General Science Volume 2, Issue 4, June-July, 2014 ISSN
2091-2730.
 Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava, “Effective Focused Crawling Based
on Content and Link Structure Analysis” (IJCSIS) International Journal of Computer
Science and Information Security, Vol. 2, No. 1, June 2009.
 Bireshwar Gangly, Rahila Sheikh, “A Review of Focused Web Crawling Strategies”
International Journal of Advanced Computer Research, volume 2, number 4 issue 6,
December 2012.
 Jaira Dubey, Divakar Singh, “A Survey on Web Crawler”, International Journal of Of
Electrical, Electronic and Computer System, ISSN (Online): 2347-2820, Volume-1, Issue
-1, 2013.
 Meenu, Rakesh Batra, “A Review of Focused Crawler Approaches”, International Journal
of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue
7, July 2014

Efficient focused web crawling approach

More Related Content

What's hot (20)

Similar to Efficient focused web crawling approach (20)

Recently uploaded (20)

Efficient focused web crawling approach