Web Scraping Basics

Kyle Banerjee
banerjek@ohsu.edu
Web Scraping Basics

The truth of the matter is...
Web scraping is one of the
worst ways to get data!

What’s wrong with scraping?
1. Slow, resources intensive, not scalable
2. Unreliable -- breaks when website
changes and works poorly with
responsive design techniques
3. Difficult to parse data
4. Harvest looks like an attack
5. Often prohibited by TOS

Before writing a scraper
Call!
● Explore better options
● Check terms of service
● Ask permission
● Can you afford scrape
errors?

Alternatives to scraping
1. Data dumps
2. API
3. Direct database connections
4. Shipping drives
5. Shared infrastructure

Many datasets are easy to retrieve

You can often export search results

Why scrape the Web?
1. Might be the only method available
2. Sometimes can get precombined or
preprocessed info that would otherwise
be hard to generate

Things to know
1. Web scraping is about parsing and
cleaning.
2. You don’t need to be a programmer, but
scripting experience is very helpful.

Excel
● Mangles your data
○ Identifiers and numeric data at risk
● Cannot handle carriage returns in data
● Crashes with large files
● OpenRefine is a better tool for situations
where you think you need Excel
http://openrefine.org

Harvesting options
● Free utilities
● Purchased software
● DaaS (Data as a Service) -- hosted web
spidering
● Write your own

Watch out for spider traps!
● Web pages that intentionally or
unintentionally cause a crawler to make
an infinite number of requests
● No algorithm can detect all spider traps

Ask for help!
1. Methods described here are familiar to
almost all systems people
2. Domain experts can help you identify tools
and shortcuts that are especially relevant
to you
3. Bouncing ideas off *anyone* usually results
in a superior outcome

Handy skills
Skill Benefit
DOM Identify and extract data
Regular expressions Identify and extract data
Command line Process large files
Scripting
Automate repetitive tasks
Perform complex operations

Handy basic tools
Tool Benefit
Web scraping service Simplify data acquisition
cURL (command line)
Easily retrieve data using
APIs
wget (command line)
Recursively retrieve web
pages
OpenRefine Process and clean data

Power tools
Tool Benefit
grep, sed, awk, tr, paste
Select and transform data in
VERY large files quickly
jq Easily manipulate JSON
xml2json Convert XML to JSON
csvkit
Utilities to convert to and
work with CSV
scrape
HTML extraction using XPath
and CSS selectors

Web scraping, the easy way
● Hosted services allow you to easily target
specific structures and pages
● Programming experience unnecessary, but
helpful
● For unfamiliar problems, ask for help

Document Object Model (DOM)
● Programming interface for HTML and XML
documents
● Supported by many languages/environments
● Represents documents in a tree structure
● Used to directly access content

Document Object Model (DOM) Tree
/document/html/body/div/p = “text node”
XPath is a syntax for defining
parts of an XML document

The Swiss Army Knife of data
Regular Expressions
● Special strings that allow you to search
and replace based on patterns
● Supported in a wide variety of software
and all operating systems

Regular expressions can...
● Use logic, capitalization, edges of
words/lines, express ranges, use bits (or
all) of what you matched in replacements
● Convert free text into XML into delimited
text or codes and vice versa
● Find complex patterns using proximity
indicators and/or involving multiple lines
● Select preferred versions of fields

Quick Regular Expression Guide
^ Match the start of the line
$ Match the end of the line
. Match any single character
* Match zero or more of the previous character
[A-D,G-J,0-5]* [A-D,G-J,0-5]* = match zero or more of ABCDGHIJ012345
[^A-C] Match any one character that is NOT A,B, or C
(dog)
Match the word "dog", including case, and remember that text
to be used later in the match or replacement
1
Insert the first remembered text as if it were typed here (2 for
second, 3 for 3rd, etc.)

Use to match special characters. matches a backslash, *
matches an asterisk, etc.

Data can contain weird problems
● XML metadata contained errors on every
field that contained an HTML entity (&
< > " ' etc)
<b>Oregon Health &amp</b>
<b> Science University</b>
● Error occurs in many fields scattered across
thousands of records
● But this can be fixed in seconds!

Regular expressions to the rescue!
● “Whenever a field ends in an HTML entity
minus the semicolon and is followed by an
identical field, join those into a single field and
fix the entity. Any line can begin with an
unknown number of tabs or spaces”
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/

Confusing at first, but easier than you think!
● Works on all platforms and is built into a
lot of software (including Office)
● Ask for help! Programmers can help you
with syntax
● Let’s walk through our example which
involves matching and joining unknown
fields across multiple lines...

Regular Expression Analysis
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
^ Beginning of line
s*< Zero or more whitespace characters followed by “<”
([^>]+>) One or more characters that are not “>” followed by “>” (i.e.
a tag). Store in 1
(.*) Any characters to next part of pattern. Store in 2
(&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3
</1n “</ followed by 1 (i.e. the closing tag) followed by a newline
s*<1 Any number of whitespace characters followed by tag 1
/<123;/ Replace everything up to this point with “<” followed by 1
(opening tag), 2 (field contents), 3, and “;” (fix HTML
entity). This effectively joins the fields

The command line
● Often the easiest way by far
● Process files of any size
● Combine the power of individual programs
in a single command (pipes)
● Supported by all major platforms

Getting started with the command line
● MacOS (use Terminal)
○ Install Homebrew
○ ‘brew install [package name]’
● Windows 10
○ Enable linux subsystem and go to bash terminal
○ ‘sudo apt-get install [package name]’
● Or install VirtualBox with linux
○ ‘sudo apt-get install [package name]’ from terminal

Learning the command line
● The power of pipes -- combine programs!
● Google solutions for specific problems --
there are many online examples
● Learn one command at a time. Don’t worry
about what you don’t need.
● Try, but give up fast. Ask linux geeks for
help.

Scripting is the command line!
● Simple text files that allow you to combine
utilities and programs written in any language
● No programming experience necessary
● Great for automating processes
● For unfamiliar problems, ask for help

wget
● A command line tool to retrieve data from web
servers
● Works on all operating systems
● Works with unstable connections
● Great for recursive downloads of data files
● Flexible. Can use patterns, specify depth, etc

wget example
wget --recursive ftp://157.98.192.110/ntp-cebs/datatype/microarray/HESI/

Filezilla is good for FTP using a GUI

cURL
● A tool to transfer data from or to a server
● Works with many protocols, can deal with
authentication
● Especially useful for APIs -- the preferred way
to download data using multiple transactions

Things that make life easier
1. JSON (JavaScript Object Notation)
2. XML (eXtensible Markup Language)
3. API (Application Programming Interface)
4. Specialized protocols
5. Using request headers to retrieve pages
that are easier to parse

There are only two kinds of data
1. Parseable
2. Unparseable
BUT
● Some structures are much easier to work
with than others
● Convert to whatever is easiest for the task
at hand

Generally speaking
● Strings
Easiest to work with, fastest, requires fewest resources,
greatest number of tools available.
● XML
Powerful but hardest to work with, slowest, requires
greatest number of resources, very inefficient for large files.
● JSON
Much more sophisticated access than strings, much easier
to work with than XML and requires fewer resources.
Awkward with certain data.

curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.json?di=04041346001043
JSON example

curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.xml?di=04041346001043
XML example

When processing large XML files
● Convert to JSON if possible, use string
based tools, or at least break the file into
smaller XML documents.
● DOM based tools such as XSLT must load
entire file into memory where it can take 10
times more space for processing
● If you need DOM based tools such XSLT,
break file into many chunks where each
record is its own document

Using APIs
● Most common type is REST (REpresentative
State Transfer) -- a fancy way of saying they
work like a Web form
● Normally have to transmit credentials or other
information. cURL is very good for this

How about Linked Data?
● Uses relationships to connect data
● Great for certain types of complex data
● You must have programming skills to download
and use these
● Often can be interacted with via API
● Can be flattened and manipulated using
traditional tools

grep
● Command line utility to select lines
matching a regular expression
● Very good for extracting just the data
you’re interested in
● Use with small or very large (terabytes)
files

sed
● Command line utility to select, parse, and
transform lines
● Great for “fixing” data so that it can be
used with other programs
● Extremely powerful and works great with
very large (terabytes) files

tr
● Command line utility to translate individual
characters from one to another
● Great for prepping data in files too large
to load into any program
● Particularly useful in combination with sed
for fixing large delimited files containing
line breaks within the data itself

paste
● Command line utility that prints
corresponding lines of files side by side
● Great for combining data from large files
● Also very handy for fixing data

Delimited file with bad line feeds
{myfile.txt}
a1,a2,a3,a4,a5
,a6
b1,b2,b3,b4
,b5,b6
c1,c2,c3,c4,c5,c6
d1
,d2,d3,d4,
d5,d6

Fixed in seconds!
tr "n" "," < myfile.txt |
sed 's/,+/,/g' | tr "," "n" | paste -s -d",,,,,n"
a1,a2,a3,a4,a5,a6
b1,b2,b3,b4,b5,b6
c1,c2,c3,c4,c5,c6
d1,d2,d3,d4,d5,d6
The power of pipes!

Command Analysis
tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" |paste -s -d",,,,,n"
tr “n” “,” < myfile.txt Convert all newlines to commas
| sed ‘/s,+/,/g’ Pipe to sed, convert all multiple instances of
commas to a single comma. Sed step is
necessary because you don’t know how
many newlines are bogus or where they are
| tr “,” “n” Pipe to tr which converts all commas into
newlines
| paste -s -d “,,,,,”n” Pipe to paste command which converts
single column file to output 6 columns wide
using a comma as a delimiter terminated by
a newline

awk
● Outstanding for reading, transforming,
and creating data in rows and columns
● Complete pattern scanning language for
text, but typically used to transform the
output of other commands

Extract 2nd and 5th fields
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6
c1 c2 c3 c4 c5 c6
d1 d2 d3 d4 d5 d6
awk '{print $2,$5}' myfile
a2 a5
b2 b5
c2 c5
d2 d5
{myfile}

jq
● Like sed, but optimized for JSON
● Includes logical and conditional operators,
variables, functions, and powerful features
● Very good for selecting, filtering, and
formatting more complex data

Extract deviceID if cuff detected
curl
https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.js
on?di=04041346001043 | jq '.gudid.device |
select(.brandName | test("cuff")) |
.identifiers.identifier.deviceId'
"04041346001043"
The power of pipes!

Don’t try to remember all this!
● Ask for help -- this stuff is easy
for linux geeks
● Google can help you with
commands/syntax
● Online forums are also helpful,
but don’t mind the trolls

If you want a GUI, use OpenRefine
http://openrefine.org
● Sophisticated, including regular
expression support
● Convert between different formats
● Up to a couple hundred thousand rows
● Even has clustering capabilities!

Normalization is more conceptual than technical
● Every situation is unique and depends on the
data you have and what you need
● Don’t fob off data analysis on technical
people who don’t understand your data
● It’s sometimes not possible to fix everything

Solutions are often domain specific!
● Data sources
● Challenges
● Tools
● Tricks

Questions?
Kyle Banerjee
banerjek@ohsu.edu

Web Scraping Basics

More Related Content

What's hot (20)

Similar to Web Scraping Basics (20)

More from Kyle Banerjee (9)

Recently uploaded (20)

Web Scraping Basics