SlideShare a Scribd company logo
4
Most read
5
Most read
7
Most read
Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017
About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
Use Cases
Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Tutorial on Web Scraping in Python
Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
Why Yellow Pages?
Email Marketing for Customer Acquisition
Email Marketing for Customer Acquisition
Initial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
nithishr1
@nithishr
nithishr@gmail.com
Connect
Nithish Raghunandanan
www.ki-labs.com
Resources
● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping

More Related Content

What's hot (20)

ODP
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
 
PPTX
Web Scraping With Python
Robert Dempsey
 
PDF
Getting started with Web Scraping in Python
Satwik Kansal
 
PDF
Intro to web scraping with Python
Maris Lemba
 
PDF
What is Web-scraping?
Yu-Chang Ho
 
PDF
Web Scraping
Carlos Rodriguez
 
PDF
What is web scraping?
Brijesh Prajapati
 
PDF
Scraping data from the web and documents
Tommy Tavenner
 
PPTX
Web Scraping Basics
Kyle Banerjee
 
PDF
Web scraping in python
Saurav Tomar
 
PPTX
Web scraping & browser automation
BHAWESH RAJPAL
 
PPTX
Web Scraping
primeteacher32
 
PPT
Web Scraping and Data Extraction Service
PromptCloud
 
PPTX
Web scraping
Ashley Davis
 
PDF
A Basic Django Introduction
Ganga Ram
 
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
PDF
Python for Data Science
Harri Hämäläinen
 
PPTX
Web mining
TeklayBirhane
 
PPTX
Data Mining: Text and web mining
DataminingTools Inc
 
PPTX
Web mining
Tanjarul Islam Mishu
 
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
 
Web Scraping With Python
Robert Dempsey
 
Getting started with Web Scraping in Python
Satwik Kansal
 
Intro to web scraping with Python
Maris Lemba
 
What is Web-scraping?
Yu-Chang Ho
 
Web Scraping
Carlos Rodriguez
 
What is web scraping?
Brijesh Prajapati
 
Scraping data from the web and documents
Tommy Tavenner
 
Web Scraping Basics
Kyle Banerjee
 
Web scraping in python
Saurav Tomar
 
Web scraping & browser automation
BHAWESH RAJPAL
 
Web Scraping
primeteacher32
 
Web Scraping and Data Extraction Service
PromptCloud
 
Web scraping
Ashley Davis
 
A Basic Django Introduction
Ganga Ram
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
Python for Data Science
Harri Hämäläinen
 
Web mining
TeklayBirhane
 
Data Mining: Text and web mining
DataminingTools Inc
 

Viewers also liked (9)

ODP
Linux Introduction (Commands)
anandvaidya
 
PPT
Hadoop introduction 2
Tianwei Liu
 
PDF
Scraping the web with python
Jose Manuel Ortega Candel
 
PDF
Linux File System
Anil Kumar Pugalia
 
PPTX
Linux.ppt
onu9
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PDF
Web Scraping with Python
Paul Schreiber
 
PPTX
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Linux Introduction (Commands)
anandvaidya
 
Hadoop introduction 2
Tianwei Liu
 
Scraping the web with python
Jose Manuel Ortega Candel
 
Linux File System
Anil Kumar Pugalia
 
Linux.ppt
onu9
 
Big Data & Hadoop Tutorial
Edureka!
 
Web Scraping with Python
Paul Schreiber
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Ad

Similar to Tutorial on Web Scraping in Python (20)

PDF
Life of a data engineer
Nithish Raghunandanan
 
PPTX
Using Web Data for Finance
Scrapinghub
 
PDF
Python in Industry
Dharmit Shah
 
PDF
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Lviv Startup Club
 
PPTX
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
content75
 
PDF
Creating data apps using Streamlit in Python
Nithish Raghunandanan
 
PDF
Data science at OLX
Alexey Grigorev
 
PDF
Django on app engine
benpotato
 
PDF
R vs Python vs SAS
Outreach Digital
 
PDF
Building Data Apps with Python
Benjamin Bengfort
 
PDF
Getting started with Scrapy in Python
Viren Rajput
 
PPTX
Computer Science Career Guidance
Deepak Sood
 
PPTX
Web mining
Renusoni8
 
PDF
Glowing bear
thehyve
 
PDF
Recommender Hackathon @plista 2013/04
Torben Brodt
 
PPTX
Dynatech presentation for TSI Career Day
Artur Babyuk
 
PDF
Curtain call of zooey - what i've learned in yahoo
羽祈 張
 
PDF
LLM-based Multi-Agent Systems to Replace Traditional Software
Ivo Andreev
 
PPTX
Application Presentation
Nuwantha Fernando
 
PDF
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
Dataconomy Media
 
Life of a data engineer
Nithish Raghunandanan
 
Using Web Data for Finance
Scrapinghub
 
Python in Industry
Dharmit Shah
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Lviv Startup Club
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
content75
 
Creating data apps using Streamlit in Python
Nithish Raghunandanan
 
Data science at OLX
Alexey Grigorev
 
Django on app engine
benpotato
 
R vs Python vs SAS
Outreach Digital
 
Building Data Apps with Python
Benjamin Bengfort
 
Getting started with Scrapy in Python
Viren Rajput
 
Computer Science Career Guidance
Deepak Sood
 
Web mining
Renusoni8
 
Glowing bear
thehyve
 
Recommender Hackathon @plista 2013/04
Torben Brodt
 
Dynatech presentation for TSI Career Day
Artur Babyuk
 
Curtain call of zooey - what i've learned in yahoo
羽祈 張
 
LLM-based Multi-Agent Systems to Replace Traditional Software
Ivo Andreev
 
Application Presentation
Nuwantha Fernando
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
Dataconomy Media
 
Ad

More from Nithish Raghunandanan (10)

PDF
Evaluating the Effectiveness of RAG in Real World Applications
Nithish Raghunandanan
 
PDF
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
Nithish Raghunandanan
 
PDF
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Nithish Raghunandanan
 
PDF
Select ML from Databases.pdf
Nithish Raghunandanan
 
PDF
Select ML from Databases
Nithish Raghunandanan
 
PDF
Virtual tourism in covid times
Nithish Raghunandanan
 
PDF
Learnings from Organizing Internal Hackathons
Nithish Raghunandanan
 
PDF
Learnings from Organizing an Internal Hackathon
Nithish Raghunandanan
 
PDF
Pecha kucha Talk on web scraping
Nithish Raghunandanan
 
PDF
Hodor: Solving Everyday Problems with Tech
Nithish Raghunandanan
 
Evaluating the Effectiveness of RAG in Real World Applications
Nithish Raghunandanan
 
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
Nithish Raghunandanan
 
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Nithish Raghunandanan
 
Select ML from Databases.pdf
Nithish Raghunandanan
 
Select ML from Databases
Nithish Raghunandanan
 
Virtual tourism in covid times
Nithish Raghunandanan
 
Learnings from Organizing Internal Hackathons
Nithish Raghunandanan
 
Learnings from Organizing an Internal Hackathon
Nithish Raghunandanan
 
Pecha kucha Talk on web scraping
Nithish Raghunandanan
 
Hodor: Solving Everyday Problems with Tech
Nithish Raghunandanan
 

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 

Tutorial on Web Scraping in Python

  • 1. Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan [email protected] PyData Munich | 8th November 2017
  • 2. About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
  • 3. What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
  • 5. Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
  • 7. Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
  • 8. Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
  • 9. Why Yellow Pages? Email Marketing for Customer Acquisition
  • 10. Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
  • 12. Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping