SlideShare a Scribd company logo
4
Most read
6
Most read
8
Most read
Scrapingtotherescue
(Webscrapingusingpython)
By : Satwik Kansal and Pradhvan Bisht
Whatiswebscraping ?
Web scraping is a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.
Getting started with Web Scraping in Python
whyshouldyouscrape
- API may not provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!
Thingsthatmightcomehandy
-HTML
-CSS
-XPATH
-Regular Expressions
Howit’sdone?
Broadly a Three Step Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data
GETTINGTHECONTENT
● Using modules like urllib, urllib2, requests, mechanize and selenium.
● Involves GET/POST request to the server.
● The response contains the information to be extracted.
● Sometimes not as easy as it may seem.
ExtractingTheData
1. Using Regular Expression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
❏ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
❏ Each modules has its own techniques and thus its own pros and trade-
offs
Getting started with Web Scraping in Python
ComparingParsers
BEAUTIFUL SOUP
LXML
SCRAPY
HTML5LIB
PreservingTheData
1. Writing to a file.
2. Exporting as csv or excel file.
3. Storing in a database.
Examples
Example 1 : Scraping Tweets from Twitter using BeautifulSoup
and python’s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.
Getting started with Web Scraping in Python
WHATTOUSEWHERE
1. Handling dynamically generated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium
Getting started with Web Scraping in Python
Scrapinghacks
1. Overcoming captchas
Lookup tables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.
Example3
Automating My College Library
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code
Getting started with Web Scraping in Python
EthicsOfScraping
Exceeding authorized use of the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.
Getting started with Web Scraping in Python
● The brute-force way to get the information required.
● Absolutely Legal
● Not always that easy.

More Related Content

What's hot (20)

PPTX
Web Scraping Basics
Kyle Banerjee
 
PDF
What is web scraping?
Brijesh Prajapati
 
PDF
Web Scraping
Carlos Rodriguez
 
PDF
Skillshare - Introduction to Data Scraping
School of Data
 
PPT
Web Crawler
iamthevictory
 
PPTX
Web mining
Tanjarul Islam Mishu
 
PPTX
Web scraping
Ashley Davis
 
PPT
Web Scraping and Data Extraction Service
PromptCloud
 
PPTX
Crawling and Indexing
Himani Tyagi
 
PDF
Web scraping in python
Saurav Tomar
 
PPTX
Web Scraping
primeteacher32
 
ODP
Web Content Mining
Daminda Herath
 
PDF
Data science presentation 2nd CI day
Mohammed Barakat
 
PPT
Web crawler
anusha kurapati
 
PPTX
Ontology mapping for the semantic web
Worawith Sangkatip
 
PPTX
Web mining
TeklayBirhane
 
PPTX
Data Warehouse
MadhuriNigam1
 
PPSX
An Introduction to Semantic Web Technology
Ankur Biswas
 
PPTX
WEB Scraping.pptx
Shubham Jaybhaye
 
PPTX
Web scraping &amp; browser automation
BHAWESH RAJPAL
 
Web Scraping Basics
Kyle Banerjee
 
What is web scraping?
Brijesh Prajapati
 
Web Scraping
Carlos Rodriguez
 
Skillshare - Introduction to Data Scraping
School of Data
 
Web Crawler
iamthevictory
 
Web scraping
Ashley Davis
 
Web Scraping and Data Extraction Service
PromptCloud
 
Crawling and Indexing
Himani Tyagi
 
Web scraping in python
Saurav Tomar
 
Web Scraping
primeteacher32
 
Web Content Mining
Daminda Herath
 
Data science presentation 2nd CI day
Mohammed Barakat
 
Web crawler
anusha kurapati
 
Ontology mapping for the semantic web
Worawith Sangkatip
 
Web mining
TeklayBirhane
 
Data Warehouse
MadhuriNigam1
 
An Introduction to Semantic Web Technology
Ankur Biswas
 
WEB Scraping.pptx
Shubham Jaybhaye
 
Web scraping &amp; browser automation
BHAWESH RAJPAL
 

Similar to Getting started with Web Scraping in Python (20)

PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
PPTX
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
PPTX
Scrapy
Francisco Sousa
 
PPTX
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
 
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
PPTX
Python ScrapingPresentation for dummy.pptx
norel46453
 
PPTX
Web Scrapping Using Python
ComputerScienceJunct
 
PPTX
Web programming using python frameworks.
Puneet Kumar Bhatia (MBA, ITIL V3 Certified)
 
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
PDF
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
PDF
Scrapy talk at DataPhilly
obdit
 
PDF
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
 
PDF
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
PPT
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 
PPT
Web scrapingpanel
Michelle Minkoff
 
PDF
Getting started with Scrapy in Python
Viren Rajput
 
PPTX
Scrappy
Vishwas N
 
PPTX
Weather data analysis presentation .pptx
YuvrajTkd
 
PPTX
Scrapy.for.dummies
Chandler Huang
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
 
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
Python ScrapingPresentation for dummy.pptx
norel46453
 
Web Scrapping Using Python
ComputerScienceJunct
 
Web programming using python frameworks.
Puneet Kumar Bhatia (MBA, ITIL V3 Certified)
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
Scrapy talk at DataPhilly
obdit
 
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
 
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 
Web scrapingpanel
Michelle Minkoff
 
Getting started with Scrapy in Python
Viren Rajput
 
Scrappy
Vishwas N
 
Weather data analysis presentation .pptx
YuvrajTkd
 
Scrapy.for.dummies
Chandler Huang
 
Ad

Recently uploaded (20)

PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Ad

Getting started with Web Scraping in Python

  • 2. Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
  • 4. whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
  • 6. Howit’sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
  • 7. GETTINGTHECONTENT ● Using modules like urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.
  • 8. ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs
  • 11. PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.
  • 12. Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
  • 14. WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
  • 16. Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
  • 17. Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
  • 19. EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
  • 21. ● The brute-force way to get the information required. ● Absolutely Legal ● Not always that easy.