SlideShare a Scribd company logo
Challenges in web crawling
CHALLENGES IN WEB CRAWLING
WEB CRAWLER
Web crawler (also known in other terms like ants, automatic
indexers, bots, web spiders, web robots) is an automated
program, or script, that methodically scans or “crawls”
through web pages to create an index of the data it is set to
look for. This process is called Web crawling or spidering.
CRAWLER
A crawler is a program that visits Web sites and reads their
pages and other information in order to create entries for
a search engine index. The major search engines on the
Web all have such a program, which is also known as a
"spider" or a "bot." Crawlers are typically programmed to
visit sites that have been submitted by their owners as new
or updated.
HOW A WEB CRAWLER WORKS
The world wide web is full of information. If you want to know
something, you can probably find the information online. But
how can you find the answer you want, when the web contains
trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the looking for us.
But how do search engines know where to look? How can
search engines recommend a few pages out of the trillions that
exist? The answer lies with web crawlers.
HOW A WEB CRAWLER WORKS
Crawlers scan web pages to see what words they contain,
and where those words are used. The crawler turns its
findings into a giant index. The index is basically a big list of
words and the web pages that feature them. So when you
ask a search engine for pages about hippos, the search
engine checks its index and gives you a list of pages that
mention hippos. Web crawlers scan the web regularly so
they always have an up-to-date index of the web.
THE SEO IMPLICATIONS OF WEB CRAWLERS
Now that you know how a web crawler works, you can see
that the behavior of the web crawler has implications for
how you optimize your website.
For example, you can see that, if you sell parachutes, it’s
important that you write about parachutes on your website.
If you don’t write about parachutes, search engines will
never suggest your website to people searching for
parachutes.
THE SEO IMPLICATIONS OF WEB CRAWLERS
It’s also important to note that web crawlers don’t just pay attention to
what words they find – they also record where the words are found. So
the web crawler knows that a word contained in headings, meta data
and the first few sentences are likely to be more important in the context
of the page, and that keywords in prime locations suggest that the page
is really ‘about’ those keywords.
So if you want search engines to know that parachutes are a big deal on
your website, mention them in your headings, meta data and opening
sentences.
The fact that web crawlers regularly trawl the web to make sure their
index is up to date also suggests that having fresh content on your
website is a good thing too.
SEARCH ENGINE INDEXES
Once the crawler has found information by crawling over the web, the
program builds the index. The index is essentially a big list of all the
words the crawler has found, as well as their location.
CHALLENGES IN WEB CRAWLING
• Challenge I: Non-Uniform Structures
• Challenge II: Omnipresence of AJAX elements
• Challenge III: The “Real” Real-Time Latency
• Challenge IV: Who owns UGC?
CHALLENGE I: NON-UNIFORM STRUCTURES
Data formats and structures are inconsistent in the ever-evolving Web space.
Also, norms on how to build an Internet presence are non-existent.
The result?
Lack of uniformity and the vast ever-changing terrains of the Internet.
The problem?
Collecting data in a machine-readable format becomes difficult. Also,
problems increase with increase in scale.
Especially, when:
a) structured data is needed, and,
b) large number of details are to be extracted w.r.t. specific schema from
multiple sources.
CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS
AJAX and interactive web components make websites more user-friendly. But
not for crawlers!
The result?
Content is produced dynamically (and on-the-go) by the browser and
therefore not visible to crawlers.
The problem?
To keep the content up-to-date, the crawler needs to be maintained manually
on a regular basis. So much so, that even Google’s crawlers find it difficult to
extract information!
The solution?
Crawlers need to be refined in their approach to be more efficient and
scalable.
CHALLENGE III: THE “REAL” REAL-TIME LATENCY
Acquiring data-sets in real-time is a huge problem! Real-time data is
critical in security and intelligence to predict, report, and enable
preemptive actions against untoward incidents.
The problem?
The real problem comes in deciding what is and isn't important in real
time.
CHALLENGE IV: WHO OWNS UGC?
User-Generated Content (UGC) proprietorship is claimed by giants
like Craigslist and Yelp and is usually out-of-bounds for commercial
crawlers.
The result?
Only 2-3 % sites disallow bots. Others believe in data democratization,
but it is possible these may follow suit and shut access to the data gold
mine!
The problem?
Site policing for web scraping and rejecting bots.
THANK YOU!

More Related Content

PDF
Current challenges in web crawling
Denis Shestakov
 
PPTX
Off page seo
safna sakkariya
 
PDF
Types of keywords in seo
Krishna Shekhar
 
PPTX
ppt presentation Google algorithm
joeydutta
 
PPTX
KEYWORD RESEARCH & SEO
AVIK BAL
 
PPT
Google Search Engine
Aniket_1415
 
PDF
Keyword Research Presentation .pdf
TheoRuby1
 
PDF
Keywords and Keyword Research.pdf
Shristi Shrestha
 
Current challenges in web crawling
Denis Shestakov
 
Off page seo
safna sakkariya
 
Types of keywords in seo
Krishna Shekhar
 
ppt presentation Google algorithm
joeydutta
 
KEYWORD RESEARCH & SEO
AVIK BAL
 
Google Search Engine
Aniket_1415
 
Keyword Research Presentation .pdf
TheoRuby1
 
Keywords and Keyword Research.pdf
Shristi Shrestha
 

What's hot (20)

PPTX
Search Engine
Coky Fauzi Alfi
 
PPTX
Commerce Platforms PDP Content Strategy: Amazon and Beyond - BrightonSEO Oct ...
MargoHowie
 
PPTX
What is seo
Rachel Fredrickson
 
PPTX
Learning About Keyword Research PPT
Ketaki Gambhir
 
PPTX
7 Ways Not to Fail at International SEO
Aleyda Solís
 
PPTX
Website audit for SEO
Dignitas Digital Pvt. Ltd.
 
PDF
How to Develop International SEO Audits for Success #IntSS
Aleyda Solís
 
PPTX
Technical seo
sunilkirangaddem
 
PPT
Seo ppt - BEGINNERS COURSE - COMPLETE GUIDE - ARISE ROBY
Arise Roby
 
PDF
Importance of Keywords in SEO
Deepraj Das
 
PPT
Social Media & SEO Proposal
Khan Aamair
 
PPTX
Introduction to SEO Presentation
7thingsmedia
 
PPTX
Seo presentation
travel_affair
 
PPT
Introduction to Google Search Console
Riley Haas
 
PPTX
Introduction to Google Analytics
Arjun Parekh
 
PPTX
SEARCH ENGINE OPTIMIZATION (SEO)
Preeti Acharya
 
PPS
Google Search Presentation
WFL Tech Trainer, Jen Farr
 
PPTX
White hat seo vs black hat seo
INDIAN SEO COMPANY
 
Search Engine
Coky Fauzi Alfi
 
Commerce Platforms PDP Content Strategy: Amazon and Beyond - BrightonSEO Oct ...
MargoHowie
 
What is seo
Rachel Fredrickson
 
Learning About Keyword Research PPT
Ketaki Gambhir
 
7 Ways Not to Fail at International SEO
Aleyda Solís
 
Website audit for SEO
Dignitas Digital Pvt. Ltd.
 
How to Develop International SEO Audits for Success #IntSS
Aleyda Solís
 
Technical seo
sunilkirangaddem
 
Seo ppt - BEGINNERS COURSE - COMPLETE GUIDE - ARISE ROBY
Arise Roby
 
Importance of Keywords in SEO
Deepraj Das
 
Social Media & SEO Proposal
Khan Aamair
 
Introduction to SEO Presentation
7thingsmedia
 
Seo presentation
travel_affair
 
Introduction to Google Search Console
Riley Haas
 
Introduction to Google Analytics
Arjun Parekh
 
SEARCH ENGINE OPTIMIZATION (SEO)
Preeti Acharya
 
Google Search Presentation
WFL Tech Trainer, Jen Farr
 
White hat seo vs black hat seo
INDIAN SEO COMPANY
 
Ad

Similar to Challenges in web crawling (20)

PPTX
Scalability andefficiencypres
NekoGato
 
PPT
Web crawler
anusha kurapati
 
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
PDF
Web Crawler For Mining Web Data
IRJET Journal
 
PPT
Webcrawler
Ekansh Purwar
 
PDF
Brief Introduction on Working of Web Crawler
rahulmonikasharma
 
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
PPT
Webcrawler
Govind Raj
 
PDF
E3602042044
ijceronline
 
PDF
HIGWGET-A Model for Crawling Secure Hidden WebPages
ijdkp
 
PPTX
Web Mining.pptx
ScrbifPt
 
PPTX
Web crawler
poonamkenkre
 
PPT
Web Crawler
iamthevictory
 
PDF
The Challenges in Crawling the Web
PromptCloud
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PPTX
4 Web Crawler.pptx
DEEPAK948083
 
PDF
IRJET - Review on Search Engine Optimization
IRJET Journal
 
PPTX
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
PDF
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
Scalability andefficiencypres
NekoGato
 
Web crawler
anusha kurapati
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Web Crawler For Mining Web Data
IRJET Journal
 
Webcrawler
Ekansh Purwar
 
Brief Introduction on Working of Web Crawler
rahulmonikasharma
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Webcrawler
Govind Raj
 
E3602042044
ijceronline
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
ijdkp
 
Web Mining.pptx
ScrbifPt
 
Web crawler
poonamkenkre
 
Web Crawler
iamthevictory
 
The Challenges in Crawling the Web
PromptCloud
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
4 Web Crawler.pptx
DEEPAK948083
 
IRJET - Review on Search Engine Optimization
IRJET Journal
 
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
Ad

More from Burhan Ahmed (20)

PPTX
Wireless mobile communication
Burhan Ahmed
 
PPTX
Virtual function
Burhan Ahmed
 
PPTX
Uses misuses and risk of software
Burhan Ahmed
 
PPTX
Types of computer
Burhan Ahmed
 
PPTX
Trees
Burhan Ahmed
 
PPTX
Topology
Burhan Ahmed
 
PPTX
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
Burhan Ahmed
 
PPTX
Software house organization
Burhan Ahmed
 
PPT
Social interaction
Burhan Ahmed
 
PPTX
Role model
Burhan Ahmed
 
PPTX
Rights and duties
Burhan Ahmed
 
PPTX
Planning work activities
Burhan Ahmed
 
PPTX
Peripheral devices
Burhan Ahmed
 
PPTX
Parallel computing and its applications
Burhan Ahmed
 
PPTX
Operator overloading
Burhan Ahmed
 
PPT
Normalization
Burhan Ahmed
 
PPTX
Managing strategy
Burhan Ahmed
 
PPT
Letter writing
Burhan Ahmed
 
PPTX
Job analysis and job design
Burhan Ahmed
 
PPTX
Intellectual property
Burhan Ahmed
 
Wireless mobile communication
Burhan Ahmed
 
Virtual function
Burhan Ahmed
 
Uses misuses and risk of software
Burhan Ahmed
 
Types of computer
Burhan Ahmed
 
Topology
Burhan Ahmed
 
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
Burhan Ahmed
 
Software house organization
Burhan Ahmed
 
Social interaction
Burhan Ahmed
 
Role model
Burhan Ahmed
 
Rights and duties
Burhan Ahmed
 
Planning work activities
Burhan Ahmed
 
Peripheral devices
Burhan Ahmed
 
Parallel computing and its applications
Burhan Ahmed
 
Operator overloading
Burhan Ahmed
 
Normalization
Burhan Ahmed
 
Managing strategy
Burhan Ahmed
 
Letter writing
Burhan Ahmed
 
Job analysis and job design
Burhan Ahmed
 
Intellectual property
Burhan Ahmed
 

Recently uploaded (20)

PDF
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PDF
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PDF
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
FSSAI (Food Safety and Standards Authority of India) & FDA (Food and Drug Adm...
Dr. Paindla Jyothirmai
 
PDF
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
PDF
Types of Literary Text: Poetry and Prose
kaelandreabibit
 
PDF
Study Material and notes for Women Empowerment
ComputerScienceSACWC
 
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
FSSAI (Food Safety and Standards Authority of India) & FDA (Food and Drug Adm...
Dr. Paindla Jyothirmai
 
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
Types of Literary Text: Poetry and Prose
kaelandreabibit
 
Study Material and notes for Women Empowerment
ComputerScienceSACWC
 

Challenges in web crawling

  • 2. CHALLENGES IN WEB CRAWLING
  • 3. WEB CRAWLER Web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.
  • 4. CRAWLER A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated.
  • 5. HOW A WEB CRAWLER WORKS The world wide web is full of information. If you want to know something, you can probably find the information online. But how can you find the answer you want, when the web contains trillions of pages? How do you know where to look? Fortunately, we have search engines to do the looking for us. But how do search engines know where to look? How can search engines recommend a few pages out of the trillions that exist? The answer lies with web crawlers.
  • 6. HOW A WEB CRAWLER WORKS Crawlers scan web pages to see what words they contain, and where those words are used. The crawler turns its findings into a giant index. The index is basically a big list of words and the web pages that feature them. So when you ask a search engine for pages about hippos, the search engine checks its index and gives you a list of pages that mention hippos. Web crawlers scan the web regularly so they always have an up-to-date index of the web.
  • 7. THE SEO IMPLICATIONS OF WEB CRAWLERS Now that you know how a web crawler works, you can see that the behavior of the web crawler has implications for how you optimize your website. For example, you can see that, if you sell parachutes, it’s important that you write about parachutes on your website. If you don’t write about parachutes, search engines will never suggest your website to people searching for parachutes.
  • 8. THE SEO IMPLICATIONS OF WEB CRAWLERS It’s also important to note that web crawlers don’t just pay attention to what words they find – they also record where the words are found. So the web crawler knows that a word contained in headings, meta data and the first few sentences are likely to be more important in the context of the page, and that keywords in prime locations suggest that the page is really ‘about’ those keywords. So if you want search engines to know that parachutes are a big deal on your website, mention them in your headings, meta data and opening sentences. The fact that web crawlers regularly trawl the web to make sure their index is up to date also suggests that having fresh content on your website is a good thing too.
  • 9. SEARCH ENGINE INDEXES Once the crawler has found information by crawling over the web, the program builds the index. The index is essentially a big list of all the words the crawler has found, as well as their location.
  • 10. CHALLENGES IN WEB CRAWLING • Challenge I: Non-Uniform Structures • Challenge II: Omnipresence of AJAX elements • Challenge III: The “Real” Real-Time Latency • Challenge IV: Who owns UGC?
  • 11. CHALLENGE I: NON-UNIFORM STRUCTURES Data formats and structures are inconsistent in the ever-evolving Web space. Also, norms on how to build an Internet presence are non-existent. The result? Lack of uniformity and the vast ever-changing terrains of the Internet. The problem? Collecting data in a machine-readable format becomes difficult. Also, problems increase with increase in scale. Especially, when: a) structured data is needed, and, b) large number of details are to be extracted w.r.t. specific schema from multiple sources.
  • 12. CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS AJAX and interactive web components make websites more user-friendly. But not for crawlers! The result? Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers. The problem? To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis. So much so, that even Google’s crawlers find it difficult to extract information! The solution? Crawlers need to be refined in their approach to be more efficient and scalable.
  • 13. CHALLENGE III: THE “REAL” REAL-TIME LATENCY Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions against untoward incidents. The problem? The real problem comes in deciding what is and isn't important in real time.
  • 14. CHALLENGE IV: WHO OWNS UGC? User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers. The result? Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine! The problem? Site policing for web scraping and rejecting bots.