SlideShare a Scribd company logo
Scraping recalcitrant web sites
              with Python & Selenium
                     Roger Barnes




SyPy July 2012
Some sites suck
Some sites suck - "for your own good"




For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
...but they work in a web browser!




  Let's use the web browser to scrape them
Enter Selenium



      Selenium automates browsers

                 That's it
Selenium can...
●   navigate (windows, frames, links)
●   find elements and parse attributes
●   interact and trigger events (click, type, ...)
●   capture screenshots
●   run javascript
●   let the browser take care of the hard stuff
    (cookies, javascript, sessions, profiles,
    DOM)

Comes with various components and bindings
                         ... including python
General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
    ○ record a session, write less code
●   python and its batteries
●   python-selenium
●   xvfb and pyvirtualdisplay (optional)
●   other libraries to taste
    ○ eg image manipulation, database access, DOM
      parsing, OCR
General Recipe
Method:
● Install requirements (apt-get, pip etc)
   ○ sudo apt-get install xvfb firefox
   ○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
   ○ Add in some assertions along the way
● Export test as Python script
● Hack from there
   ○ Loops
   ○ Image/data extraction
   ○ Wrangling data into a database
Scraping recalcitrant web sites with Python & Selenium
Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait( 30)
        self.base_url = "https://www.ingdirect.com.au"
        self.verificationErrors = []

   def test_ingdirect2(self):
       driver = self.driver
                                                           But what about
       driver.get( self.base_url + "/client/index.aspx")
                                                           that dang
       driver.switch_to_frame( 'body') # Had to add this keypad? ...
       driver.find_element_by_id( "txtCIF").clear()
       driver.find_element_by_id( "txtCIF").send_keys( "12345678")
       driver.find_element_by_id( "objKeypad_B1").click()
       driver.find_element_by_id( "objKeypad_B2").click()
       driver.find_element_by_id( "objKeypad_B3").click()
       driver.find_element_by_id( "objKeypad_B4").click()
       driver.find_element_by_id( "btnLogin").click()
       self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
    button_image = im.crop(getcropbox(button))
    hexid = hashlib.md5(button_image.tostring()).hexdigest()
    button_mapping[hexid] = button.get_attribute( "id")


# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
    driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one
But why do all this?
It's my data!                                  ... and I'll graph if i want to




       * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
That's all folks
Slides
● http://bit.ly/scrapium

Code
● https://gist.github.com/3015852

Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au

More Related Content

What's hot (20)

PPTX
Web driver training
Dipesh Bhatewara
 
DOC
Selenium Automation Using Ruby
Kumari Warsha Goel
 
PDF
Selenium webdriver
sean_todd
 
PPT
Automated Testing With Watir
Timothy Fisher
 
PDF
Detecting headless browsers
Sergey Shekyan
 
PDF
Webdriver.io
LinkMe Srl
 
PPTX
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Agile Testing Alliance
 
PDF
Introduction to Protractor
Jie-Wei Wu
 
PDF
Jenkins and Groovy
Kiyotaka Oku
 
PPT
Introduction To Ruby Watir (Web Application Testing In Ruby)
Mindfire Solutions
 
PDF
High Performance JavaScript 2011
Nicholas Zakas
 
PPTX
Code ceptioninstallation
Andrii Lagovskiy
 
PPTX
An introduction to PhantomJS: A headless browser for automation test.
BugRaptors
 
PPTX
Protractor framework – how to make stable e2e tests for Angular applications
Ludmila Nesvitiy
 
PPTX
Session on Launching Selenium Grid and Running tests using docker compose and...
Agile Testing Alliance
 
PDF
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Ondřej Machulda
 
PDF
Integrační testy - Selenium
Keyup
 
PPTX
Protractor Tutorial Quality in Agile 2015
Andrew Eisenberg
 
PDF
淺談 Groovy 與 AWS 雲端應用開發整合
Kyle Lin
 
Web driver training
Dipesh Bhatewara
 
Selenium Automation Using Ruby
Kumari Warsha Goel
 
Selenium webdriver
sean_todd
 
Automated Testing With Watir
Timothy Fisher
 
Detecting headless browsers
Sergey Shekyan
 
Webdriver.io
LinkMe Srl
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Agile Testing Alliance
 
Introduction to Protractor
Jie-Wei Wu
 
Jenkins and Groovy
Kiyotaka Oku
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Mindfire Solutions
 
High Performance JavaScript 2011
Nicholas Zakas
 
Code ceptioninstallation
Andrii Lagovskiy
 
An introduction to PhantomJS: A headless browser for automation test.
BugRaptors
 
Protractor framework – how to make stable e2e tests for Angular applications
Ludmila Nesvitiy
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Agile Testing Alliance
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Ondřej Machulda
 
Integrační testy - Selenium
Keyup
 
Protractor Tutorial Quality in Agile 2015
Andrew Eisenberg
 
淺談 Groovy 與 AWS 雲端應用開發整合
Kyle Lin
 

Similar to Scraping recalcitrant web sites with Python & Selenium (20)

PPTX
Lesson2-Selenium installation 2-6-25.pptx
131881omarfernandez1
 
PPTX
Controlling the browser through python and selenium
Patrick Viafore
 
PPTX
Python ScrapingPresentation for dummy.pptx
norel46453
 
PDF
Selenium for-ops
Łukasz Proszek
 
PDF
Web Scrapping with Python
Miguel Miranda de Mattos
 
PDF
Browser-level testing
Martin Kleppmann
 
PDF
Intro to Selenium UI Tests with pytest & some useful pytest plugins
Asif Mohaimen
 
PDF
Web scraping using semi-automated browsing
WARCnet
 
PDF
Automating Django Functional Tests Using Selenium on Cloud
Jonghyun Park
 
PDF
2010 za con_jurgens_van_der_merwe
Johan Klerk
 
PPTX
Selenium.pptx
Pandiya Rajan
 
PPTX
تست وب اپ ها با سلنیوم - علیرضا عظیم زاده میلانی
irpycon
 
PDF
Intro to web scraping with Python
Maris Lemba
 
PPTX
Web Scrapping Using Python
ComputerScienceJunct
 
PPTX
Python Tutorial-Mining imgur images
Weiai Wayne Xu
 
PDF
Rf meetup 16.3.2017 tampere share
Mika Tavi
 
PPTX
Open Source Automation Tools That Really Work V2
An Doan
 
PPTX
Writing automation tests with python selenium behave pageobjects
Leticia Rss
 
PDF
Robot framework and selenium2 library
krishantha_samaraweera
 
PPT
Introduction to python scrapping
n|u - The Open Security Community
 
Lesson2-Selenium installation 2-6-25.pptx
131881omarfernandez1
 
Controlling the browser through python and selenium
Patrick Viafore
 
Python ScrapingPresentation for dummy.pptx
norel46453
 
Selenium for-ops
Łukasz Proszek
 
Web Scrapping with Python
Miguel Miranda de Mattos
 
Browser-level testing
Martin Kleppmann
 
Intro to Selenium UI Tests with pytest & some useful pytest plugins
Asif Mohaimen
 
Web scraping using semi-automated browsing
WARCnet
 
Automating Django Functional Tests Using Selenium on Cloud
Jonghyun Park
 
2010 za con_jurgens_van_der_merwe
Johan Klerk
 
Selenium.pptx
Pandiya Rajan
 
تست وب اپ ها با سلنیوم - علیرضا عظیم زاده میلانی
irpycon
 
Intro to web scraping with Python
Maris Lemba
 
Web Scrapping Using Python
ComputerScienceJunct
 
Python Tutorial-Mining imgur images
Weiai Wayne Xu
 
Rf meetup 16.3.2017 tampere share
Mika Tavi
 
Open Source Automation Tools That Really Work V2
An Doan
 
Writing automation tests with python selenium behave pageobjects
Leticia Rss
 
Robot framework and selenium2 library
krishantha_samaraweera
 
Introduction to python scrapping
n|u - The Open Security Community
 
Ad

More from Roger Barnes (6)

PDF
The life of a web request - techniques for measuring and improving Django app...
Roger Barnes
 
PDF
Building data flows with Celery and SQLAlchemy
Roger Barnes
 
ODP
Introduction to SQL Alchemy - SyPy June 2013
Roger Barnes
 
PDF
Poker, packets, pipes and Python
Roger Barnes
 
PDF
Towards Continuous Deployment with Django
Roger Barnes
 
PDF
Intro to Pinax: Kickstarting Your Django Apps
Roger Barnes
 
The life of a web request - techniques for measuring and improving Django app...
Roger Barnes
 
Building data flows with Celery and SQLAlchemy
Roger Barnes
 
Introduction to SQL Alchemy - SyPy June 2013
Roger Barnes
 
Poker, packets, pipes and Python
Roger Barnes
 
Towards Continuous Deployment with Django
Roger Barnes
 
Intro to Pinax: Kickstarting Your Django Apps
Roger Barnes
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 

Scraping recalcitrant web sites with Python & Selenium

  • 1. Scraping recalcitrant web sites with Python & Selenium Roger Barnes SyPy July 2012
  • 3. Some sites suck - "for your own good" For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed
  • 4. ...but they work in a web browser! Let's use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers That's it
  • 6. Selenium can... ● navigate (windows, frames, links) ● find elements and parse attributes ● interact and trigger events (click, type, ...) ● capture screenshots ● run javascript ● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM) Comes with various components and bindings ... including python
  • 7. General Recipe Ingredients: ● firefox (or chrome) ● firebug (or chrome dev tools) ● Selenium IDE ○ record a session, write less code ● python and its batteries ● python-selenium ● xvfb and pyvirtualdisplay (optional) ● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General Recipe Method: ● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay ● Start up Firefox and Selenium IDE ● Record a "test" run through site ○ Add in some assertions along the way ● Export test as Python script ● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 10. Example from Selenium IDE class Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( 'body') # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 11. PIL saves the day # Get screenshot for extraction of button images screenshot = driver.get_screenshot_as_base64() im = Image.open(StringIO.StringIO(base64.decodestring(screenshot))) table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table') all_buttons = table.find_elements_by_tag_name( "input") # Determine md5sum of each button by cropping based on element positions for button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id") # Now we know which button is which ( based on previous lookup), enter the PIN for char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click() driver.find_element_by_id( "btnLogin").click() # We're in!!!11one
  • 12. But why do all this? It's my data! ... and I'll graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 13. That's all folks Slides ● http://bit.ly/scrapium Code ● https://gist.github.com/3015852 Me ● https://twitter.com/mindsocket ● https://github.com/mindsocket ● [email protected]