Web Scraping

Dr. Carlos Rodríguez Contreras
UNAM
Web Scraping

Web Scraping
 Web scraping, web harvesting, or web data extraction is data
scraping used for extracting data from websites.
 Web scraping software may access the World Wide Web directly
using the Hypertext Transfer Protocol, or through a web browser.
 While web scraping can be done manually by a software user, the
term typically refers to automated processes implemented using
a bot or web crawler.

Web Scraping
 It is a form of copying, in which specific data is gathered and copied
from the web, typically into a central local database or spreadsheet,
for later retrieval or analysis.
 Scraping a web page involves fetching it and extracting data from it.
 Fetching is the downloading of a page (which a browser does when
you view the page).
 Therefore, web crawling is a main component of web scraping, to
fetch pages for later processing.

Web Scraping
 Once fetched, the extraction can take place. The content of a page
may be parsed, searched, reformatted, its data copied into a
spreadsheet, and so on.
 Web scrapers typically take something out of a page, to make use
of it for another purpose somewhere else.
 An example would be to find and copy names and phone numbers,
or companies and their URLs, to a list (contact scraping).

Web Scraping
 Web scraping is used as a component of applications for web
indexing, web mining and data mining, online price change
monitoring and price comparison, product review scraping (to watch
the competition), gathering real estate listings, weather data
monitoring, website change detection, research, tracking online
presence and reputation, web mashup and, web data integration.
 Newer forms of web scraping involve listening to data feeds from
web servers. For example, JSON is commonly used as a transport
storage mechanism between the client and the web server.

 Each year Forbes ranks the world based on a variety categories
ranging from the wealthiest people on the planet to the best
colleges America has to offer.
 Forbes is the preeminent maintainer of covering a wide range of
business related topics including sports, entertainment, individual
wealth, and locations.
 The lists are chocked full of data that can be analyzed, visualized
and merged with other data.

 Upon discovering that Forbes
went to the pains of building an
API, even though it is
undocumented, Alex Bresler
decided to build forbesListR
package to wrap that API and
make it as easy as possible to
access data with a few simple
functions.

Forbes Global 2000
Scraping with R

Web Scraping

More Related Content

What's hot (20)

Similar to Web Scraping (20)

Recently uploaded (20)

Web Scraping