SlideShare a Scribd company logo
Acquiring Data
Data Science for Beginners, Session 3
Session 3: your 5-7 things
• Finding development data
• Data filetypes
• Using an API
• PDF scrapers
• Web Scrapers
• Getting data ready for science
Finding development
data
Data
• Data files (CSV, Excel, Json, Xml...)
• Databases (sqlite, mysql, oracle, postgresql...)
• APIs
• Report tables (tables on websites, in pdf reports...)
• Text (reports and other documents…)
• Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
• Images (satellite images, drone footage, pictures, videos…)
Data Sources
• data warehouses and catalogues
• open government data
• NGO websites
• web searches
• online documents, images, maps etc
• people you know who might have data
Creating your own data: People
Creating your own data: Sensors
Be cynical about your data
• Is the data relevant to your problem?
• Where did this data come from?
– Who collected it?
– Why? What for?
– Do they have biases that might show up in the data?
• Are there holes in the data (demographic, geographical, political etc)?
• Do you have supporting data? Is it *really* from a different source?
Data filetypes
Some Data Types
• Structured data:
– Tables (e.g. CSVs, Excel tables)
– Relational data (e.g. json, xml, sqlite)
• Unstructured data:
– Free-text (e.g. Tweets, webpages etc)
• Maps and images:
– Vector data (e.g. shapefiles)
– Raster data (e.g geotiffs)
– Images
CSVs
• Comma-separated values
• Lots of commas
• Sometimes tab-separated (TSVs)
• Most applications read CSVs
Json
• JavaScript Object Notation
• Lots of braces { }
• Structured, i.e. not always row-by-column
• Many APIs output JSON
• Not all applications read JSON
XML
• eXtensible Markup Language
• Lots of brackets < >
• Structured, i.e. not always row-by-column
• Some applications read XML
• HTML is a form of XML
Using an API
APIs
• “Application Programming Interface”
• A way for one computer application to ask
another one for a service
–Usually “give me this data”
–Sometimes “add this to your datasets”
RESTful APIs
http://api.worldbank.org/countries/all/indicators/SP.RUR.TO
TL.ZS?date=2000:2015&format=csv
• Base URL: api.worldbank.org
• What you’re asking for:
countries/all/indicators/SP.RUR.TOTL.ZA
• Details: date=2000:2015, format=csv
curl -X GET <URL>
Using CURL on the command-line
Do this: try these URLs
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=csv
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=json
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=xml
the Python Requests library
import requests
import json
worldbank_url =
"http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:20
15&format=json"
r = requests.get(worldbank_url)
jsondata = json.loads(r.text)
print(jsondata[1])
Request errors
r.status_code =
• 200: okay
• 400: bad request
• 401: unauthorised
• 404: page not found
Requests with a password
import requests
r = requests.get('https://api.github.com/user',
auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text
PDF Scrapers
Scraping
• Data in files and webpages that’s easy for
humans to read, but difficult for machines
• Don’t scrape unless you have to
–Small dataset: type it in!
–Larger dataset: Look for datasets and APIs online
Development data is often in PDFs
Some PDFs can be Scraped
• Open the PDF file in Acrobat
• Can you cut-and-paste text in the file?
–Y:
• use a PDF scraper
–N:
PDF Table Scrapers
• Cut and paste to Excel
• Tabula: free, open source, offline
• Pdftables: not free, online
• CometDocs: free, online
Web Scrapers
Web Scraping
Design First!
What do you need to scrape?
● Which data values
● From which formats (html table, excel, pdf etc)
Do you need to maintain this?
● Is dataset regularly updated, or is once enough?
● How will you make updated data available to other people?
● Who could edit your code next year (if needed)?
Using Google Spreadsheets
• Open a google spreadsheet
• Put this into cell A1:
=importHtml("http://en.wikipedia.org/wiki/List_of_U.S._stat
es_and_territories_by_population", "table", 2)
Web scraping in Python
● Webpage-grabbing libraries:
o requests
o mechanize
o cookielib
● Element-finding libraries:
o beautifulsoup
Unpicking HTML with Python
url =
"https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population”
import requests
from bs4 import BeautifulSoup
html = requests.get(url)
bsObj = BeautifulSoup(html.text)
tables = bsObj.find_all('table’)
tables[0].find("th")
Getting data ready for
science
Changing Data Formats
• Conversion websites
• Code:
import pandas as pd
df = pd.read_json(“myfilename1.json”)
df.write_csv(“myfilename2.csv”)
Normalising data
Books
• "Web Scraping with Python: Collecting Data from the
Modern Web", O'Reilly
Exercises
Prepare for next week
• Install Tableau
–See install instructions file
Prepare data
• Use your problem statement to look for datasets - what do
you need to answer your questions?
• If you can, convert your data into normalised CSV files
• Think about your data gaps - how can you fill them?

More Related Content

PPTX
F# for Data*
Sergey Tihon
 
PDF
Let your data shine... with OpenRefine
Open Knowledge Belgium
 
PDF
Talis Insight Europe 2017 - Using Talis data with other datasets - Tim Hodson
Talis
 
PDF
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Open Knowledge Belgium
 
PPTX
Data and Donuts: Data cleaning with OpenRefine
C. Tobin Magle
 
PDF
Evolution of the Graph Schema
Joshua Shinavier
 
PPTX
Exploratory querying of the Dutch GeoRegisters
Stanislav Ronzhin
 
PPTX
Exploratory Data Analysis
thinrhino
 
F# for Data*
Sergey Tihon
 
Let your data shine... with OpenRefine
Open Knowledge Belgium
 
Talis Insight Europe 2017 - Using Talis data with other datasets - Tim Hodson
Talis
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Open Knowledge Belgium
 
Data and Donuts: Data cleaning with OpenRefine
C. Tobin Magle
 
Evolution of the Graph Schema
Joshua Shinavier
 
Exploratory querying of the Dutch GeoRegisters
Stanislav Ronzhin
 
Exploratory Data Analysis
thinrhino
 

What's hot (20)

PDF
Problem Solving with Algorithms and Data Structures
Yi-Lung Tsai
 
PPTX
OpenRefine Tutorial
Alex Petralia
 
PDF
Corpus studio Erwin Komen
CLARIAH
 
PDF
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
Joshua Shinavier
 
PDF
Visualising Data on Interactive Maps
Anna Pawlicka
 
PPTX
Scalable Web Data Management using RDF
Navid Sedighpour
 
PPTX
Chapter4
Fahad Sheref
 
PPTX
Deriving an Emergent Relational Schema from RDF Data
Graph-TA
 
PDF
Why is JSON-LD Important to Businesses - Franz Inc
Franz Inc. - AllegroGraph
 
PDF
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 
PPTX
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
PDF
The DataTank at ogdcamp Warsaw
Pieter Colpaert
 
PPTX
Reproducible research
C. Tobin Magle
 
PDF
The Power of Machine Learning and Graphs
Franz Inc. - AllegroGraph
 
ODP
OpenRefine - Data Science Training for Librarians
tfmorris
 
PDF
xlwings reports: Reporting with Excel & Python
xlwings
 
PDF
TinkerPop 2020
Joshua Shinavier
 
PPT
Computer-assisted reporting seminar
Glen McGregor
 
PDF
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Peter Haase
 
PDF
2017 ii 5_katharina_schleidt_datacovestatisticalviewer
ATTRACTIVE DANUBE
 
Problem Solving with Algorithms and Data Structures
Yi-Lung Tsai
 
OpenRefine Tutorial
Alex Petralia
 
Corpus studio Erwin Komen
CLARIAH
 
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
Joshua Shinavier
 
Visualising Data on Interactive Maps
Anna Pawlicka
 
Scalable Web Data Management using RDF
Navid Sedighpour
 
Chapter4
Fahad Sheref
 
Deriving an Emergent Relational Schema from RDF Data
Graph-TA
 
Why is JSON-LD Important to Businesses - Franz Inc
Franz Inc. - AllegroGraph
 
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
The DataTank at ogdcamp Warsaw
Pieter Colpaert
 
Reproducible research
C. Tobin Magle
 
The Power of Machine Learning and Graphs
Franz Inc. - AllegroGraph
 
OpenRefine - Data Science Training for Librarians
tfmorris
 
xlwings reports: Reporting with Excel & Python
xlwings
 
TinkerPop 2020
Joshua Shinavier
 
Computer-assisted reporting seminar
Glen McGregor
 
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Peter Haase
 
2017 ii 5_katharina_schleidt_datacovestatisticalviewer
ATTRACTIVE DANUBE
 
Ad

Similar to Session 03 acquiring data (20)

PPTX
Data web analytics scraping 12345_II.pptx
utjimmyx
 
PPT
Data, data, data
andrewxhill
 
PPT
Data Munging in concepts of data mining in DS
nazimsattar
 
PPTX
Web Scraping_ Gathering Data from Websites.pptx
HitechIOT
 
PPTX
Datasets, APIs, and Web Scraping
Damian T. Gordon
 
PPTX
open-data-presentation.pptx
DennicaRivera
 
PDF
Python 101 for Data Science to Absolute Beginners
Sai Linn Thu
 
PPTX
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
PPTX
Session 01 designing and scoping a data science project
bodaceacat
 
PDF
Data Science: Harnessing Open Data for High Impact Solutions
Mohd Izhar Firdaus Ismail
 
PDF
Introduction to web scraping
Dario Cottafava
 
PDF
Getting comfortable with Data
Ritvvij Parrikh
 
PDF
Data Visualization in the Newsroom
Carl V. Lewis
 
PPTX
Data-Analytics using python (Module 4).pptx
DRSHk10
 
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral
 
PPTX
Linked Open Data
Lars Marius Garshol
 
PPTX
Data science.chapter-1,2,3
varshakumar21
 
PDF
When data journalism meets science | Erice, June 10th, 2014
Dataninja
 
PPTX
Social Media Data Collection & Analysis
Scott Sanders
 
PPTX
Open Data Journalism
Gabriella Razzano
 
Data web analytics scraping 12345_II.pptx
utjimmyx
 
Data, data, data
andrewxhill
 
Data Munging in concepts of data mining in DS
nazimsattar
 
Web Scraping_ Gathering Data from Websites.pptx
HitechIOT
 
Datasets, APIs, and Web Scraping
Damian T. Gordon
 
open-data-presentation.pptx
DennicaRivera
 
Python 101 for Data Science to Absolute Beginners
Sai Linn Thu
 
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
Session 01 designing and scoping a data science project
bodaceacat
 
Data Science: Harnessing Open Data for High Impact Solutions
Mohd Izhar Firdaus Ismail
 
Introduction to web scraping
Dario Cottafava
 
Getting comfortable with Data
Ritvvij Parrikh
 
Data Visualization in the Newsroom
Carl V. Lewis
 
Data-Analytics using python (Module 4).pptx
DRSHk10
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral
 
Linked Open Data
Lars Marius Garshol
 
Data science.chapter-1,2,3
varshakumar21
 
When data journalism meets science | Erice, June 10th, 2014
Dataninja
 
Social Media Data Collection & Analysis
Scott Sanders
 
Open Data Journalism
Gabriella Razzano
 
Ad

More from bodaceacat (20)

PPTX
CansecWest2019: Infosec Frameworks for Misinformation
bodaceacat
 
PDF
2019 11 terp_breuer_disclosure_master
bodaceacat
 
PPTX
Terp breuer misinfosecframeworks_cansecwest2019
bodaceacat
 
PPTX
Misinfosec frameworks Cansecwest 2019
bodaceacat
 
PPTX
Sjterp ds_of_misinfo_feb_2019
bodaceacat
 
PPTX
Practical Influence Operations, presentation at Sofwerx Dec 2018
bodaceacat
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Session 09 learning relationships.pptx
bodaceacat
 
PPTX
Session 08 geospatial data
bodaceacat
 
PPTX
Session 07 text data.pptx
bodaceacat
 
PPTX
Session 06 machine learning.pptx
bodaceacat
 
PPTX
Session 05 cleaning and exploring
bodaceacat
 
PPTX
Session 04 communicating results
bodaceacat
 
PPTX
Session 02 python basics
bodaceacat
 
ODP
Gp technologybuilds july2011
bodaceacat
 
ODP
Gp technologybuilds july2011
bodaceacat
 
ODP
Ardrone represent
bodaceacat
 
PPTX
Global pulse app connection manager
bodaceacat
 
PPT
Un Pulse Camp - Humanitarian Innovation
bodaceacat
 
PPT
Blue light services
bodaceacat
 
CansecWest2019: Infosec Frameworks for Misinformation
bodaceacat
 
2019 11 terp_breuer_disclosure_master
bodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
bodaceacat
 
Misinfosec frameworks Cansecwest 2019
bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
bodaceacat
 
Session 10 handling bigger data
bodaceacat
 
Session 09 learning relationships.pptx
bodaceacat
 
Session 08 geospatial data
bodaceacat
 
Session 07 text data.pptx
bodaceacat
 
Session 06 machine learning.pptx
bodaceacat
 
Session 05 cleaning and exploring
bodaceacat
 
Session 04 communicating results
bodaceacat
 
Session 02 python basics
bodaceacat
 
Gp technologybuilds july2011
bodaceacat
 
Gp technologybuilds july2011
bodaceacat
 
Ardrone represent
bodaceacat
 
Global pulse app connection manager
bodaceacat
 
Un Pulse Camp - Humanitarian Innovation
bodaceacat
 
Blue light services
bodaceacat
 

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Presentation on animal welfare a good topic
kidscream385
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
blockchain123456789012345678901234567890
tanvikhunt1003
 

Session 03 acquiring data

  • 1. Acquiring Data Data Science for Beginners, Session 3
  • 2. Session 3: your 5-7 things • Finding development data • Data filetypes • Using an API • PDF scrapers • Web Scrapers • Getting data ready for science
  • 4. Data • Data files (CSV, Excel, Json, Xml...) • Databases (sqlite, mysql, oracle, postgresql...) • APIs • Report tables (tables on websites, in pdf reports...) • Text (reports and other documents…) • Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) • Images (satellite images, drone footage, pictures, videos…)
  • 5. Data Sources • data warehouses and catalogues • open government data • NGO websites • web searches • online documents, images, maps etc • people you know who might have data
  • 6. Creating your own data: People
  • 7. Creating your own data: Sensors
  • 8. Be cynical about your data • Is the data relevant to your problem? • Where did this data come from? – Who collected it? – Why? What for? – Do they have biases that might show up in the data? • Are there holes in the data (demographic, geographical, political etc)? • Do you have supporting data? Is it *really* from a different source?
  • 10. Some Data Types • Structured data: – Tables (e.g. CSVs, Excel tables) – Relational data (e.g. json, xml, sqlite) • Unstructured data: – Free-text (e.g. Tweets, webpages etc) • Maps and images: – Vector data (e.g. shapefiles) – Raster data (e.g geotiffs) – Images
  • 11. CSVs • Comma-separated values • Lots of commas • Sometimes tab-separated (TSVs) • Most applications read CSVs
  • 12. Json • JavaScript Object Notation • Lots of braces { } • Structured, i.e. not always row-by-column • Many APIs output JSON • Not all applications read JSON
  • 13. XML • eXtensible Markup Language • Lots of brackets < > • Structured, i.e. not always row-by-column • Some applications read XML • HTML is a form of XML
  • 15. APIs • “Application Programming Interface” • A way for one computer application to ask another one for a service –Usually “give me this data” –Sometimes “add this to your datasets”
  • 16. RESTful APIs http://api.worldbank.org/countries/all/indicators/SP.RUR.TO TL.ZS?date=2000:2015&format=csv • Base URL: api.worldbank.org • What you’re asking for: countries/all/indicators/SP.RUR.TOTL.ZA • Details: date=2000:2015, format=csv
  • 17. curl -X GET <URL> Using CURL on the command-line
  • 18. Do this: try these URLs • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=csv • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=json • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=xml
  • 19. the Python Requests library import requests import json worldbank_url = "http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:20 15&format=json" r = requests.get(worldbank_url) jsondata = json.loads(r.text) print(jsondata[1])
  • 20. Request errors r.status_code = • 200: okay • 400: bad request • 401: unauthorised • 404: page not found
  • 21. Requests with a password import requests r = requests.get('https://api.github.com/user', auth=('yourgithubname', ‘yourgithubpassword')) dataset = r.text
  • 23. Scraping • Data in files and webpages that’s easy for humans to read, but difficult for machines • Don’t scrape unless you have to –Small dataset: type it in! –Larger dataset: Look for datasets and APIs online
  • 24. Development data is often in PDFs
  • 25. Some PDFs can be Scraped • Open the PDF file in Acrobat • Can you cut-and-paste text in the file? –Y: • use a PDF scraper –N:
  • 26. PDF Table Scrapers • Cut and paste to Excel • Tabula: free, open source, offline • Pdftables: not free, online • CometDocs: free, online
  • 29. Design First! What do you need to scrape? ● Which data values ● From which formats (html table, excel, pdf etc) Do you need to maintain this? ● Is dataset regularly updated, or is once enough? ● How will you make updated data available to other people? ● Who could edit your code next year (if needed)?
  • 30. Using Google Spreadsheets • Open a google spreadsheet • Put this into cell A1: =importHtml("http://en.wikipedia.org/wiki/List_of_U.S._stat es_and_territories_by_population", "table", 2)
  • 31. Web scraping in Python ● Webpage-grabbing libraries: o requests o mechanize o cookielib ● Element-finding libraries: o beautifulsoup
  • 32. Unpicking HTML with Python url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population” import requests from bs4 import BeautifulSoup html = requests.get(url) bsObj = BeautifulSoup(html.text) tables = bsObj.find_all('table’) tables[0].find("th")
  • 33. Getting data ready for science
  • 34. Changing Data Formats • Conversion websites • Code: import pandas as pd df = pd.read_json(“myfilename1.json”) df.write_csv(“myfilename2.csv”)
  • 36. Books • "Web Scraping with Python: Collecting Data from the Modern Web", O'Reilly
  • 38. Prepare for next week • Install Tableau –See install instructions file
  • 39. Prepare data • Use your problem statement to look for datasets - what do you need to answer your questions? • If you can, convert your data into normalised CSV files • Think about your data gaps - how can you fill them?

Editor's Notes

  • #2: Today we’re looking at the types of data that are hiding online, and how to bring them out of hiding and into your data science code.
  • #3: So let’s begin. Here are the 6 things we’ll talk about today.
  • #4: Your first problem is finding the data to help answer your questions.
  • #5: A quick recap: these are some of the places where you can find data. Some of them are harder to process than others, but they all contain data.
  • #6: And here are some places to find them - there’s a longer list in the references folder.
  • #7: Development data isn’t always easy to obtain: you might have to create your own, by asking people to contribute information to you through crowdsourcing, in-person surveys, mobile surveys etc.
  • #8: You might also need to generate data for your problem by using sensors.
  • #9: Selection bias = non-random selection of individuals. One example of this is pothole reporting: potholes are more generally reported in more-affluent areas, by people who have both the smartphone apps and the time and energy to report. Missing data = data that you don’t have. You need to be aware of this, and take account of it. If you need more persuading, read about Wald and the bullethole problem.
  • #10: There are many datafile types - here’s a guide to some of them.
  • #11: Tables typically have rows and columns; relational data is typically hierarchical, e.g. can’t be easily converted into row-column form.
  • #12: CSVs are the workhorse of datatypes: almost every data application can read them in.
  • #13: Converting JSON to CSV: Use a conversion website (e.g. http://www.convertcsv.com/json-to-csv.htm) Write some Python code
  • #14: Converting XML to CSV: Use a conversion website, e.g. http://www.convertcsv.com/xml-to-csv.htm Write code
  • #15: One way to obtain data is through an application programming interface (API).
  • #16: More about open APIs: https://en.wikipedia.org/wiki/Open_API
  • #17: REST = Representational State Transfer; a human-readable way to ask APIs for information. At the top is a RESTful URL (web address); you can type this directly into an internet browser to get a datafile. This address has 3 parts to it: The base url, api.worldbank.org a description of what you’re looking for - in this case, the total rural population for all countries in the world Some more details, including filters (only data between 2000 and 2015) and data formats. Try this address, and try “&format=json” instead of “&format=csv” at the end.
  • #18: REST = Representational State Transfer; a human-readable way to ask APIs for information. At the top is a RESTful URL (web address); you can type this directly into an internet browser to get a datafile. This address has 3 parts to it: The base url, api.worldbank.org a description of what you’re looking for - in this case, the total rural population for all countries in the world Some more details, including filters (only data between 2000 and 2015) and data formats. Try this address, and try “&format=json” instead of “&format=csv” at the end.
  • #20: The Python requests library is useful for calling APIs from a python program (e.g. so you can then use or save the information returned from them). If anything goes wrong, try r.status_code You’re maybe wondering how to get this json data into a file. Here’s the code for that: import json fout = open('mynewdata.json', 'w') json.dump(jsondata, fout)
  • #21: See https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
  • #24: Here are places to look first: the website that data’s in, for file copies of the data the website that data’s in, for an api (http://api.theirsitename.com/, http://theirsitename.com/api, Google “site:theirsitename.com api”) related sites for file copies and apis Community warehouses (scraperwiki.com, datahub.io etc.) for other peoples’ scrapers
  • #25: Big PDFs. And we’ll need to get the data out of them. This is where PDF scrapers come in.
  • #29: Web scraping is the process of extracting data from webpages. If you open a webpage (e.g. https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population) and click on “view source”, you’ll see the view that a computer has of that page. This is where the data is hiding…
  • #31: The pattern for this is: =importHtml(“your-weburl”, “table”, yourtablenumber) More: www.mulinblog.com/basic-web-scraping-data-visualization-using-google-spreadsheets/
  • #32: You’ve already used the Requests library to grab data from the web. Mechanise and Cookielib
  • #34: Your exercises were all built into the class. But if you want more…
  • #35: Most data science and visualisation programs can read CSV data, so if you can easily convert data to that, good. There are websites that will convert to csv; you can also do this by reading data in one format, and writing it out in another. The Pandas library is very helpful for reading in one format, and writing in another, if the data is row-column.
  • #36: We’ll cover data cleaning later, but if you want to try next week’s visualisation techniques on your own data, it will need to at least be normalised. Here’s what we mean by this (and Tableau has a tool for doing this: see http://kb.tableau.com/articles/knowledgebase/denormalize-data).
  • #37: Most data science and visualisation programs can read CSV data, so if you can easily convert data to that, good. There are websites that will convert to csv; you can also do this by reading data in one format, and writing it out in another. The Pandas library is very helpful for reading in one format, and writing in another, if the data is row-column.
  • #38: Your exercises were all built into the class. But if you want more…