SlideShare a Scribd company logo
Kyle Banerjee
banerjek@ohsu.edu
Web Scraping Basics
The truth of the matter is...
Web scraping is one of the
worst ways to get data!
What’s wrong with scraping?
1. Slow, resources intensive, not scalable
2. Unreliable -- breaks when website
changes and works poorly with
responsive design techniques
3. Difficult to parse data
4. Harvest looks like an attack
5. Often prohibited by TOS
Before writing a scraper
Call!
● Explore better options
● Check terms of service
● Ask permission
● Can you afford scrape
errors?
Alternatives to scraping
1. Data dumps
2. API
3. Direct database connections
4. Shipping drives
5. Shared infrastructure
Many datasets are easy to retrieve
You can often export search results
Why scrape the Web?
1. Might be the only method available
2. Sometimes can get precombined or
preprocessed info that would otherwise
be hard to generate
Things to know
1. Web scraping is about parsing and
cleaning.
2. You don’t need to be a programmer, but
scripting experience is very helpful.
Don’t use Excel. Seriously.
Excel
● Mangles your data
○ Identifiers and numeric data at risk
● Cannot handle carriage returns in data
● Crashes with large files
● OpenRefine is a better tool for situations
where you think you need Excel
http://openrefine.org
Harvesting options
● Free utilities
● Purchased software
● DaaS (Data as a Service) -- hosted web
spidering
● Write your own
Watch out for spider traps!
● Web pages that intentionally or
unintentionally cause a crawler to make
an infinite number of requests
● No algorithm can detect all spider traps
Ask for help!
1. Methods described here are familiar to
almost all systems people
2. Domain experts can help you identify tools
and shortcuts that are especially relevant
to you
3. Bouncing ideas off *anyone* usually results
in a superior outcome
Handy skills
Skill Benefit
DOM Identify and extract data
Regular expressions Identify and extract data
Command line Process large files
Scripting
Automate repetitive tasks
Perform complex operations
Handy basic tools
Tool Benefit
Web scraping service Simplify data acquisition
cURL (command line)
Easily retrieve data using
APIs
wget (command line)
Recursively retrieve web
pages
OpenRefine Process and clean data
Power tools
Tool Benefit
grep, sed, awk, tr, paste
Select and transform data in
VERY large files quickly
jq Easily manipulate JSON
xml2json Convert XML to JSON
csvkit
Utilities to convert to and
work with CSV
scrape
HTML extraction using XPath
and CSS selectors
Web scraping, the easy way
● Hosted services allow you to easily target
specific structures and pages
● Programming experience unnecessary, but
helpful
● For unfamiliar problems, ask for help
Hosted example, Scrapinghub
Scrapinghub data output
Document Object Model (DOM)
● Programming interface for HTML and XML
documents
● Supported by many languages/environments
● Represents documents in a tree structure
● Used to directly access content
Document Object Model (DOM) Tree
/document/html/body/div/p = “text node”
XPath is a syntax for defining
parts of an XML document
The Swiss Army Knife of data
Regular Expressions
● Special strings that allow you to search
and replace based on patterns
● Supported in a wide variety of software
and all operating systems
Regular expressions can...
● Use logic, capitalization, edges of
words/lines, express ranges, use bits (or
all) of what you matched in replacements
● Convert free text into XML into delimited
text or codes and vice versa
● Find complex patterns using proximity
indicators and/or involving multiple lines
● Select preferred versions of fields
Quick Regular Expression Guide
^ Match the start of the line
$ Match the end of the line
. Match any single character
* Match zero or more of the previous character
[A-D,G-J,0-5]* [A-D,G-J,0-5]* = match zero or more of ABCDGHIJ012345
[^A-C] Match any one character that is NOT A,B, or C
(dog)
Match the word "dog", including case, and remember that text
to be used later in the match or replacement
1
Insert the first remembered text as if it were typed here (2 for
second, 3 for 3rd, etc.)

Use to match special characters.  matches a backslash, *
matches an asterisk, etc.
Data can contain weird problems
● XML metadata contained errors on every
field that contained an HTML entity (&
< > " ' etc)
<b>Oregon Health &amp</b>
<b> Science University</b>
● Error occurs in many fields scattered across
thousands of records
● But this can be fixed in seconds!
Regular expressions to the rescue!
● “Whenever a field ends in an HTML entity
minus the semicolon and is followed by an
identical field, join those into a single field and
fix the entity. Any line can begin with an
unknown number of tabs or spaces”
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
Confusing at first, but easier than you think!
● Works on all platforms and is built into a
lot of software (including Office)
● Ask for help! Programmers can help you
with syntax
● Let’s walk through our example which
involves matching and joining unknown
fields across multiple lines...
Regular Expression Analysis
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
^ Beginning of line
s*< Zero or more whitespace characters followed by “<”
([^>]+>) One or more characters that are not “>” followed by “>” (i.e.
a tag). Store in 1
(.*) Any characters to next part of pattern. Store in 2
(&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3
</1n “</ followed by 1 (i.e. the closing tag) followed by a newline
s*<1 Any number of whitespace characters followed by tag 1
/<123;/ Replace everything up to this point with “<” followed by 1
(opening tag), 2 (field contents), 3, and “;” (fix HTML
entity). This effectively joins the fields
The command line
● Often the easiest way by far
● Process files of any size
● Combine the power of individual programs
in a single command (pipes)
● Supported by all major platforms
Getting started with the command line
● MacOS (use Terminal)
○ Install Homebrew
○ ‘brew install [package name]’
● Windows 10
○ Enable linux subsystem and go to bash terminal
○ ‘sudo apt-get install [package name]’
● Or install VirtualBox with linux
○ ‘sudo apt-get install [package name]’ from terminal
Learning the command line
● The power of pipes -- combine programs!
● Google solutions for specific problems --
there are many online examples
● Learn one command at a time. Don’t worry
about what you don’t need.
● Try, but give up fast. Ask linux geeks for
help.
Scripting is the command line!
● Simple text files that allow you to combine
utilities and programs written in any language
● No programming experience necessary
● Great for automating processes
● For unfamiliar problems, ask for help
wget
● A command line tool to retrieve data from web
servers
● Works on all operating systems
● Works with unstable connections
● Great for recursive downloads of data files
● Flexible. Can use patterns, specify depth, etc
wget example
wget --recursive ftp://157.98.192.110/ntp-cebs/datatype/microarray/HESI/
Filezilla is good for FTP using a GUI
cURL
● A tool to transfer data from or to a server
● Works with many protocols, can deal with
authentication
● Especially useful for APIs -- the preferred way
to download data using multiple transactions
Things that make life easier
1. JSON (JavaScript Object Notation)
2. XML (eXtensible Markup Language)
3. API (Application Programming Interface)
4. Specialized protocols
5. Using request headers to retrieve pages
that are easier to parse
There are only two kinds of data
1. Parseable
2. Unparseable
BUT
● Some structures are much easier to work
with than others
● Convert to whatever is easiest for the task
at hand
Generally speaking
● Strings
Easiest to work with, fastest, requires fewest resources,
greatest number of tools available.
● XML
Powerful but hardest to work with, slowest, requires
greatest number of resources, very inefficient for large files.
● JSON
Much more sophisticated access than strings, much easier
to work with than XML and requires fewer resources.
Awkward with certain data.
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.json?di=04041346001043
JSON example
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.xml?di=04041346001043
XML example
When processing large XML files
● Convert to JSON if possible, use string
based tools, or at least break the file into
smaller XML documents.
● DOM based tools such as XSLT must load
entire file into memory where it can take 10
times more space for processing
● If you need DOM based tools such XSLT,
break file into many chunks where each
record is its own document
Using APIs
● Most common type is REST (REpresentative
State Transfer) -- a fancy way of saying they
work like a Web form
● Normally have to transmit credentials or other
information. cURL is very good for this
How about Linked Data?
● Uses relationships to connect data
● Great for certain types of complex data
● You must have programming skills to download
and use these
● Often can be interacted with via API
● Can be flattened and manipulated using
traditional tools
grep
● Command line utility to select lines
matching a regular expression
● Very good for extracting just the data
you’re interested in
● Use with small or very large (terabytes)
files
sed
● Command line utility to select, parse, and
transform lines
● Great for “fixing” data so that it can be
used with other programs
● Extremely powerful and works great with
very large (terabytes) files
tr
● Command line utility to translate individual
characters from one to another
● Great for prepping data in files too large
to load into any program
● Particularly useful in combination with sed
for fixing large delimited files containing
line breaks within the data itself
paste
● Command line utility that prints
corresponding lines of files side by side
● Great for combining data from large files
● Also very handy for fixing data
Delimited file with bad line feeds
{myfile.txt}
a1,a2,a3,a4,a5
,a6
b1,b2,b3,b4
,b5,b6
c1,c2,c3,c4,c5,c6
d1
,d2,d3,d4,
d5,d6
Fixed in seconds!
tr "n" "," < myfile.txt | 
sed 's/,+/,/g' | tr "," "n" | paste -s -d",,,,,n"
a1,a2,a3,a4,a5,a6
b1,b2,b3,b4,b5,b6
c1,c2,c3,c4,c5,c6
d1,d2,d3,d4,d5,d6
The power of pipes!
Command Analysis
tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" |paste -s -d",,,,,n"
tr “n” “,” < myfile.txt Convert all newlines to commas
| sed ‘/s,+/,/g’ Pipe to sed, convert all multiple instances of
commas to a single comma. Sed step is
necessary because you don’t know how
many newlines are bogus or where they are
| tr “,” “n” Pipe to tr which converts all commas into
newlines
| paste -s -d “,,,,,”n” Pipe to paste command which converts
single column file to output 6 columns wide
using a comma as a delimiter terminated by
a newline
awk
● Outstanding for reading, transforming,
and creating data in rows and columns
● Complete pattern scanning language for
text, but typically used to transform the
output of other commands
Extract 2nd and 5th fields
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6
c1 c2 c3 c4 c5 c6
d1 d2 d3 d4 d5 d6
awk '{print $2,$5}' myfile
a2 a5
b2 b5
c2 c5
d2 d5
{myfile}
jq
● Like sed, but optimized for JSON
● Includes logical and conditional operators,
variables, functions, and powerful features
● Very good for selecting, filtering, and
formatting more complex data
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.json?di=04041346001043
JSON example
Extract deviceID if cuff detected
curl
https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.js
on?di=04041346001043 | jq '.gudid.device |
select(.brandName | test("cuff")) |
.identifiers.identifier.deviceId'
"04041346001043"
The power of pipes!
Don’t try to remember all this!
● Ask for help -- this stuff is easy
for linux geeks
● Google can help you with
commands/syntax
● Online forums are also helpful,
but don’t mind the trolls
If you want a GUI, use OpenRefine
http://openrefine.org
● Sophisticated, including regular
expression support
● Convert between different formats
● Up to a couple hundred thousand rows
● Even has clustering capabilities!
Web Scraping Basics
Normalization is more conceptual than technical
● Every situation is unique and depends on the
data you have and what you need
● Don’t fob off data analysis on technical
people who don’t understand your data
● It’s sometimes not possible to fix everything
Solutions are often domain specific!
● Data sources
● Challenges
● Tools
● Tricks
Questions?
Kyle Banerjee
banerjek@ohsu.edu

More Related Content

What's hot (20)

PDF
Introduction to Python for Data Science
Arc & Codementor
 
PPT
Web Scraping and Data Extraction Service
PromptCloud
 
PDF
Getting started with Web Scraping in Python
Satwik Kansal
 
PPTX
Web Scraping With Python
Robert Dempsey
 
ODP
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
 
PPTX
WEB Scraping.pptx
Shubham Jaybhaye
 
PPTX
Web scraping &amp; browser automation
BHAWESH RAJPAL
 
PPTX
web mining
Arpit Verma
 
PPTX
Web crawler
poonamkenkre
 
PDF
Python for Data Science
Harri Hämäläinen
 
PDF
Intro to beautiful soup
Andreas Chandra
 
PDF
Web Scraping
Carlos Rodriguez
 
PDF
Web mining slides
mahavir_a
 
PPTX
Data Mining
SHIKHA GAUTAM
 
PPTX
Web Scraping
primeteacher32
 
PPTX
Data preprocessing in Machine learning
pyingkodi maran
 
PPTX
Web content mining
Akanksha Dombe
 
PDF
Data Engineering Basics
Catherine Kimani
 
PPTX
Web scraping
Ashley Davis
 
PDF
Website Layout and Structure
Michael Zinniger
 
Introduction to Python for Data Science
Arc & Codementor
 
Web Scraping and Data Extraction Service
PromptCloud
 
Getting started with Web Scraping in Python
Satwik Kansal
 
Web Scraping With Python
Robert Dempsey
 
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
 
WEB Scraping.pptx
Shubham Jaybhaye
 
Web scraping &amp; browser automation
BHAWESH RAJPAL
 
web mining
Arpit Verma
 
Web crawler
poonamkenkre
 
Python for Data Science
Harri Hämäläinen
 
Intro to beautiful soup
Andreas Chandra
 
Web Scraping
Carlos Rodriguez
 
Web mining slides
mahavir_a
 
Data Mining
SHIKHA GAUTAM
 
Web Scraping
primeteacher32
 
Data preprocessing in Machine learning
pyingkodi maran
 
Web content mining
Akanksha Dombe
 
Data Engineering Basics
Catherine Kimani
 
Web scraping
Ashley Davis
 
Website Layout and Structure
Michael Zinniger
 

Similar to Web Scraping Basics (20)

KEY
Scraping Scripting Hacking
Mike Ellis
 
PDF
Information Retrieval and Extraction
Christopher Frenz
 
PDF
Better problem solving through scripting: How to think through your #eprdctn ...
BookNet Canada
 
PDF
The SEO's Guide to Scraping Everything
eppievojt
 
PDF
Data science at the command line
Sharat Chikkerur
 
PDF
OpenFest 2012 : Leveraging the public internet
tkisason
 
PDF
QueryPath: It's like PHP jQuery in Drupal!
Matt Butcher
 
PDF
Lecture_4.pdf
SteveHuang50
 
PDF
Project Automation
elliando dias
 
PDF
Os Harkins
oscon2007
 
PDF
The Lumber Mill - XSLT For Your Templates
Thomas Weinert
 
PPTX
Regular Expressions in PHP
Andrew Kandels
 
PDF
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
PPT
Jagmohancrawl
Jag Mohan Singh
 
PDF
The Lumber Mill Xslt For Your Templates
Thomas Weinert
 
PDF
12 core technologies you should learn, love, and hate to be a 'real' technocrat
linoj
 
PPTX
Info 2402 irt-chapter_3
Shahriar Rafee
 
PDF
Industrial training report
Akash Kr Sinha
 
PPTX
Secure Coding
Shubham Sharma
 
Scraping Scripting Hacking
Mike Ellis
 
Information Retrieval and Extraction
Christopher Frenz
 
Better problem solving through scripting: How to think through your #eprdctn ...
BookNet Canada
 
The SEO's Guide to Scraping Everything
eppievojt
 
Data science at the command line
Sharat Chikkerur
 
OpenFest 2012 : Leveraging the public internet
tkisason
 
QueryPath: It's like PHP jQuery in Drupal!
Matt Butcher
 
Lecture_4.pdf
SteveHuang50
 
Project Automation
elliando dias
 
Os Harkins
oscon2007
 
The Lumber Mill - XSLT For Your Templates
Thomas Weinert
 
Regular Expressions in PHP
Andrew Kandels
 
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Jagmohancrawl
Jag Mohan Singh
 
The Lumber Mill Xslt For Your Templates
Thomas Weinert
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
linoj
 
Info 2402 irt-chapter_3
Shahriar Rafee
 
Industrial training report
Akash Kr Sinha
 
Secure Coding
Shubham Sharma
 
Ad

More from Kyle Banerjee (9)

PPTX
Getting Started with the Alma API
Kyle Banerjee
 
PPTX
Demystifying RDF
Kyle Banerjee
 
PPTX
Keep it Safe, Stupid, or an Intro to Digital Preservation
Kyle Banerjee
 
PPTX
Future Directions in Metadata
Kyle Banerjee
 
PPTX
Переход от отдельных библиотечных систем к объединенной системе Альма
Kyle Banerjee
 
PPTX
Normalizing Data for Migrations
Kyle Banerjee
 
PPTX
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Kyle Banerjee
 
PPT
Batch metadata assignment to archival photograph collections using facial rec...
Kyle Banerjee
 
PPT
Intro to XML in libraries
Kyle Banerjee
 
Getting Started with the Alma API
Kyle Banerjee
 
Demystifying RDF
Kyle Banerjee
 
Keep it Safe, Stupid, or an Intro to Digital Preservation
Kyle Banerjee
 
Future Directions in Metadata
Kyle Banerjee
 
Переход от отдельных библиотечных систем к объединенной системе Альма
Kyle Banerjee
 
Normalizing Data for Migrations
Kyle Banerjee
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Kyle Banerjee
 
Batch metadata assignment to archival photograph collections using facial rec...
Kyle Banerjee
 
Intro to XML in libraries
Kyle Banerjee
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 

Web Scraping Basics

  • 2. The truth of the matter is... Web scraping is one of the worst ways to get data!
  • 3. What’s wrong with scraping? 1. Slow, resources intensive, not scalable 2. Unreliable -- breaks when website changes and works poorly with responsive design techniques 3. Difficult to parse data 4. Harvest looks like an attack 5. Often prohibited by TOS
  • 4. Before writing a scraper Call! ● Explore better options ● Check terms of service ● Ask permission ● Can you afford scrape errors?
  • 5. Alternatives to scraping 1. Data dumps 2. API 3. Direct database connections 4. Shipping drives 5. Shared infrastructure
  • 6. Many datasets are easy to retrieve
  • 7. You can often export search results
  • 8. Why scrape the Web? 1. Might be the only method available 2. Sometimes can get precombined or preprocessed info that would otherwise be hard to generate
  • 9. Things to know 1. Web scraping is about parsing and cleaning. 2. You don’t need to be a programmer, but scripting experience is very helpful.
  • 10. Don’t use Excel. Seriously.
  • 11. Excel ● Mangles your data ○ Identifiers and numeric data at risk ● Cannot handle carriage returns in data ● Crashes with large files ● OpenRefine is a better tool for situations where you think you need Excel http://openrefine.org
  • 12. Harvesting options ● Free utilities ● Purchased software ● DaaS (Data as a Service) -- hosted web spidering ● Write your own
  • 13. Watch out for spider traps! ● Web pages that intentionally or unintentionally cause a crawler to make an infinite number of requests ● No algorithm can detect all spider traps
  • 14. Ask for help! 1. Methods described here are familiar to almost all systems people 2. Domain experts can help you identify tools and shortcuts that are especially relevant to you 3. Bouncing ideas off *anyone* usually results in a superior outcome
  • 15. Handy skills Skill Benefit DOM Identify and extract data Regular expressions Identify and extract data Command line Process large files Scripting Automate repetitive tasks Perform complex operations
  • 16. Handy basic tools Tool Benefit Web scraping service Simplify data acquisition cURL (command line) Easily retrieve data using APIs wget (command line) Recursively retrieve web pages OpenRefine Process and clean data
  • 17. Power tools Tool Benefit grep, sed, awk, tr, paste Select and transform data in VERY large files quickly jq Easily manipulate JSON xml2json Convert XML to JSON csvkit Utilities to convert to and work with CSV scrape HTML extraction using XPath and CSS selectors
  • 18. Web scraping, the easy way ● Hosted services allow you to easily target specific structures and pages ● Programming experience unnecessary, but helpful ● For unfamiliar problems, ask for help
  • 21. Document Object Model (DOM) ● Programming interface for HTML and XML documents ● Supported by many languages/environments ● Represents documents in a tree structure ● Used to directly access content
  • 22. Document Object Model (DOM) Tree /document/html/body/div/p = “text node” XPath is a syntax for defining parts of an XML document
  • 23. The Swiss Army Knife of data Regular Expressions ● Special strings that allow you to search and replace based on patterns ● Supported in a wide variety of software and all operating systems
  • 24. Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields
  • 25. Quick Regular Expression Guide ^ Match the start of the line $ Match the end of the line . Match any single character * Match zero or more of the previous character [A-D,G-J,0-5]* [A-D,G-J,0-5]* = match zero or more of ABCDGHIJ012345 [^A-C] Match any one character that is NOT A,B, or C (dog) Match the word "dog", including case, and remember that text to be used later in the match or replacement 1 Insert the first remembered text as if it were typed here (2 for second, 3 for 3rd, etc.) Use to match special characters. matches a backslash, * matches an asterisk, etc.
  • 26. Data can contain weird problems ● XML metadata contained errors on every field that contained an HTML entity (&amp; &lt; &gt; &quot; &apos; etc) <b>Oregon Health &amp</b> <b> Science University</b> ● Error occurs in many fields scattered across thousands of records ● But this can be fixed in seconds!
  • 27. Regular expressions to the rescue! ● “Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
  • 28. Confusing at first, but easier than you think! ● Works on all platforms and is built into a lot of software (including Office) ● Ask for help! Programmers can help you with syntax ● Let’s walk through our example which involves matching and joining unknown fields across multiple lines...
  • 29. Regular Expression Analysis /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/ ^ Beginning of line s*< Zero or more whitespace characters followed by “<” ([^>]+>) One or more characters that are not “>” followed by “>” (i.e. a tag). Store in 1 (.*) Any characters to next part of pattern. Store in 2 (&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3 </1n “</ followed by 1 (i.e. the closing tag) followed by a newline s*<1 Any number of whitespace characters followed by tag 1 /<123;/ Replace everything up to this point with “<” followed by 1 (opening tag), 2 (field contents), 3, and “;” (fix HTML entity). This effectively joins the fields
  • 30. The command line ● Often the easiest way by far ● Process files of any size ● Combine the power of individual programs in a single command (pipes) ● Supported by all major platforms
  • 31. Getting started with the command line ● MacOS (use Terminal) ○ Install Homebrew ○ ‘brew install [package name]’ ● Windows 10 ○ Enable linux subsystem and go to bash terminal ○ ‘sudo apt-get install [package name]’ ● Or install VirtualBox with linux ○ ‘sudo apt-get install [package name]’ from terminal
  • 32. Learning the command line ● The power of pipes -- combine programs! ● Google solutions for specific problems -- there are many online examples ● Learn one command at a time. Don’t worry about what you don’t need. ● Try, but give up fast. Ask linux geeks for help.
  • 33. Scripting is the command line! ● Simple text files that allow you to combine utilities and programs written in any language ● No programming experience necessary ● Great for automating processes ● For unfamiliar problems, ask for help
  • 34. wget ● A command line tool to retrieve data from web servers ● Works on all operating systems ● Works with unstable connections ● Great for recursive downloads of data files ● Flexible. Can use patterns, specify depth, etc
  • 35. wget example wget --recursive ftp://157.98.192.110/ntp-cebs/datatype/microarray/HESI/
  • 36. Filezilla is good for FTP using a GUI
  • 37. cURL ● A tool to transfer data from or to a server ● Works with many protocols, can deal with authentication ● Especially useful for APIs -- the preferred way to download data using multiple transactions
  • 38. Things that make life easier 1. JSON (JavaScript Object Notation) 2. XML (eXtensible Markup Language) 3. API (Application Programming Interface) 4. Specialized protocols 5. Using request headers to retrieve pages that are easier to parse
  • 39. There are only two kinds of data 1. Parseable 2. Unparseable BUT ● Some structures are much easier to work with than others ● Convert to whatever is easiest for the task at hand
  • 40. Generally speaking ● Strings Easiest to work with, fastest, requires fewest resources, greatest number of tools available. ● XML Powerful but hardest to work with, slowest, requires greatest number of resources, very inefficient for large files. ● JSON Much more sophisticated access than strings, much easier to work with than XML and requires fewer resources. Awkward with certain data.
  • 43. When processing large XML files ● Convert to JSON if possible, use string based tools, or at least break the file into smaller XML documents. ● DOM based tools such as XSLT must load entire file into memory where it can take 10 times more space for processing ● If you need DOM based tools such XSLT, break file into many chunks where each record is its own document
  • 44. Using APIs ● Most common type is REST (REpresentative State Transfer) -- a fancy way of saying they work like a Web form ● Normally have to transmit credentials or other information. cURL is very good for this
  • 45. How about Linked Data? ● Uses relationships to connect data ● Great for certain types of complex data ● You must have programming skills to download and use these ● Often can be interacted with via API ● Can be flattened and manipulated using traditional tools
  • 46. grep ● Command line utility to select lines matching a regular expression ● Very good for extracting just the data you’re interested in ● Use with small or very large (terabytes) files
  • 47. sed ● Command line utility to select, parse, and transform lines ● Great for “fixing” data so that it can be used with other programs ● Extremely powerful and works great with very large (terabytes) files
  • 48. tr ● Command line utility to translate individual characters from one to another ● Great for prepping data in files too large to load into any program ● Particularly useful in combination with sed for fixing large delimited files containing line breaks within the data itself
  • 49. paste ● Command line utility that prints corresponding lines of files side by side ● Great for combining data from large files ● Also very handy for fixing data
  • 50. Delimited file with bad line feeds {myfile.txt} a1,a2,a3,a4,a5 ,a6 b1,b2,b3,b4 ,b5,b6 c1,c2,c3,c4,c5,c6 d1 ,d2,d3,d4, d5,d6
  • 51. Fixed in seconds! tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" | paste -s -d",,,,,n" a1,a2,a3,a4,a5,a6 b1,b2,b3,b4,b5,b6 c1,c2,c3,c4,c5,c6 d1,d2,d3,d4,d5,d6 The power of pipes!
  • 52. Command Analysis tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" |paste -s -d",,,,,n" tr “n” “,” < myfile.txt Convert all newlines to commas | sed ‘/s,+/,/g’ Pipe to sed, convert all multiple instances of commas to a single comma. Sed step is necessary because you don’t know how many newlines are bogus or where they are | tr “,” “n” Pipe to tr which converts all commas into newlines | paste -s -d “,,,,,”n” Pipe to paste command which converts single column file to output 6 columns wide using a comma as a delimiter terminated by a newline
  • 53. awk ● Outstanding for reading, transforming, and creating data in rows and columns ● Complete pattern scanning language for text, but typically used to transform the output of other commands
  • 54. Extract 2nd and 5th fields a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5 b6 c1 c2 c3 c4 c5 c6 d1 d2 d3 d4 d5 d6 awk '{print $2,$5}' myfile a2 a5 b2 b5 c2 c5 d2 d5 {myfile}
  • 55. jq ● Like sed, but optimized for JSON ● Includes logical and conditional operators, variables, functions, and powerful features ● Very good for selecting, filtering, and formatting more complex data
  • 57. Extract deviceID if cuff detected curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.js on?di=04041346001043 | jq '.gudid.device | select(.brandName | test("cuff")) | .identifiers.identifier.deviceId' "04041346001043" The power of pipes!
  • 58. Don’t try to remember all this! ● Ask for help -- this stuff is easy for linux geeks ● Google can help you with commands/syntax ● Online forums are also helpful, but don’t mind the trolls
  • 59. If you want a GUI, use OpenRefine http://openrefine.org ● Sophisticated, including regular expression support ● Convert between different formats ● Up to a couple hundred thousand rows ● Even has clustering capabilities!
  • 61. Normalization is more conceptual than technical ● Every situation is unique and depends on the data you have and what you need ● Don’t fob off data analysis on technical people who don’t understand your data ● It’s sometimes not possible to fix everything
  • 62. Solutions are often domain specific! ● Data sources ● Challenges ● Tools ● Tricks