Adrian Stevenson, Senior Technical Coordinator, Jisc Manchester
Tools for Data Manipulation
UKAD Open RefineWorkshop, Jisc London, 18th March 2016
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 2
Workshop Resources
Available from:
http://data.archiveshub.ac.uk/workshops/ukad2016/readme.html
Link to Open Refine and plugins
Link to example data used for workshop
Link to completed Open Refine project from todays
workshop
Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for
working with messy data: cleaning it; transforming it from
one format into another; and extending it with web
services and external data.
Main Uses:
• Explore data
• Clean and transform data
• Reconcile and match data
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 3
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 4
Installing and running Open Refine
Download from:
http://openrefine.org/download.html
Run and in a web browser go to: http://127.0.0.1:3333/
Select ‘create project’ and browse for Archives Hub
example csv data file
Note: May need to clear browser cache to see new projects
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 5
Clean andTransform - Facets and Clustering
Strip white space
Transform Upper case, title case
Split multi valued cells or Edit col > Split several cols
Facet on label
Order by count
Cluster and rename rows
Undo
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 6
Clean - Remove Duplicate rows
Sort on column with duplicates and reorder permanently
Facet duplicates to check
Watch for OR switching from rows to records view
Edit cells > Blank Down
Facet by blank
Remove all matching
Essence of Open Refine is using facets and filters to isolate
rows and invoke commands to affect all these rows together
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 7
URIs
LD Design Issues
Triples
http://www.w3.org/DesignIssues/LinkedData.html
8
Triples
Triples statements
»‘Things’ have ‘properties’ with ‘values’
»Subject – Predicate - Object
Archival
Resource
Repository Provides Access To
Pride and
Prejudice
Jane Austen Is Author Of
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 9
Triples are the basis of RDF and Linked Data
owl:sameAs
Hub Person - owl:sameAs -VIAF Person
<http://data.archiveshub.ac.uk/id/person/nra/webbma
rthabeatrice1858-1943socialreformer>
owl:sameAs
<http://viaf.org/viaf/86607236> .
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 10
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 11
Matching Names toVIAF
May need to join columns together, for example to give more
consistent name form, e.g using:
cells["FamilyName"].value + ", " + cells["GivenName"].value + ", " +
cells["Dates"].value
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 12
Matching Names toVIAF
VIAF reconciliation service details at:
http://iphylo.blogspot.co.uk/2013/04/reconciling-author-names-using-open.html
May need to add as a ‘standard service’ under Reconcile >
Start reconciling. Service URL is:
http://iphylo.org/~rpage/phyloinformatics/services/reconcil
iation_viaf.php
Other recon services e.g. LCSH at:
https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-
Sources
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 13
RDF Export
Download RDF Refine Extension from http://refine.deri.ie/
Unzip
Open Project > Browse workspace directory
Create ‘extensions’ folder (if doesn’t exist)
Copy RDF Refine unzipped folder to workspace directory
Restart Open Refine
Need to create column withVIAF URIs for export:
"http://viaf.org/viaf/"+cell.recon.match.id
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 14
Matching Subjects to LCSH
Click RDF button in the top right corner, select ‘Add reconciliation
service, Based on SPARQL endpoint’.
Add following parameters:
Name: LCSH
Endpoint URL: http://sparql.freeyourmetadata.org/
Graph URI: http://id.loc.gov/authorities/subjects
Type:Virtuoso
Label properties: check only skos:prefLabel
Martha BeatriceWebb
Place of birth:Gloucester,
England
Place of death: Liphook,
Hampshire, England
Life dates: 1858-1943
Epithet: social reformer
and historian
Family name:Webb
Image
from: BeatriceWebb letters
BeatriceWebb (1858 - 1943). Fabian Socialist, social reformer, writer,
historian, diarist.Wife, collaborator and assistant of SidneyWebb,
later Lord Passfield.Together they contributed to the radical
ideology first of the Liberal Party and later of the Labour Party.
from: BeatriceWebb,A summer holiday in Scotland, 1884.
BeatriceWebb (1858-1943), nee Potter, social reformer and diarist.
Married to SidneyWebb, pioneers of social science. She was
involved in many spheres of political and social activity including the
Labour Party, Fabianism, social observation, investigations into
poverty, development of socialism, the foundation of the National
Health Service and post war welfare state, the London School of
Biographical Notes
Works
Our Partnership
My Apprenticeship
The case for the factory acts
BeatriceWebb’s diaries; edited by MargaretCole
The Diary
Knows
http://dbpedia.org/page/George_Bernard_Shaw
http://dbpedia.org/page/Sidney_Webb,_1st_Bar
on_Passfield
15Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/
Contact
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 16
Adrian Stevenson
SeniorTechnical Coordinator
Jisc Manchester
http://www.jisc.ac.uk
adrian.stevenson@jisc.ac.uk
http://www.twitter.com/adrianstevenson
https://www.linkedin.com/in/adrianstevenson
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 17
CC License
This presentation available under creative commons Non
Commercial-Share Alike:
http://creativecommons.org/licenses/by-nc/2.0/uk/

Tools for Data Manipulation - UKAD Open Refine Workshop

  • 1.
    Adrian Stevenson, SeniorTechnical Coordinator, Jisc Manchester Tools for Data Manipulation UKAD Open RefineWorkshop, Jisc London, 18th March 2016
  • 2.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 2 Workshop Resources Available from: http://data.archiveshub.ac.uk/workshops/ukad2016/readme.html Link to Open Refine and plugins Link to example data used for workshop Link to completed Open Refine project from todays workshop
  • 3.
    Open Refine OpenRefine (formerlyGoogle Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. Main Uses: • Explore data • Clean and transform data • Reconcile and match data Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 3
  • 4.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 4 Installing and running Open Refine Download from: http://openrefine.org/download.html Run and in a web browser go to: http://127.0.0.1:3333/ Select ‘create project’ and browse for Archives Hub example csv data file Note: May need to clear browser cache to see new projects
  • 5.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 5 Clean andTransform - Facets and Clustering Strip white space Transform Upper case, title case Split multi valued cells or Edit col > Split several cols Facet on label Order by count Cluster and rename rows Undo
  • 6.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 6 Clean - Remove Duplicate rows Sort on column with duplicates and reorder permanently Facet duplicates to check Watch for OR switching from rows to records view Edit cells > Blank Down Facet by blank Remove all matching Essence of Open Refine is using facets and filters to isolate rows and invoke commands to affect all these rows together
  • 7.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 7
  • 8.
  • 9.
    Triples Triples statements »‘Things’ have‘properties’ with ‘values’ »Subject – Predicate - Object Archival Resource Repository Provides Access To Pride and Prejudice Jane Austen Is Author Of Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 9 Triples are the basis of RDF and Linked Data
  • 10.
    owl:sameAs Hub Person -owl:sameAs -VIAF Person <http://data.archiveshub.ac.uk/id/person/nra/webbma rthabeatrice1858-1943socialreformer> owl:sameAs <http://viaf.org/viaf/86607236> . Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 10
  • 11.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 11 Matching Names toVIAF May need to join columns together, for example to give more consistent name form, e.g using: cells["FamilyName"].value + ", " + cells["GivenName"].value + ", " + cells["Dates"].value
  • 12.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 12 Matching Names toVIAF VIAF reconciliation service details at: http://iphylo.blogspot.co.uk/2013/04/reconciling-author-names-using-open.html May need to add as a ‘standard service’ under Reconcile > Start reconciling. Service URL is: http://iphylo.org/~rpage/phyloinformatics/services/reconcil iation_viaf.php Other recon services e.g. LCSH at: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data- Sources
  • 13.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 13 RDF Export Download RDF Refine Extension from http://refine.deri.ie/ Unzip Open Project > Browse workspace directory Create ‘extensions’ folder (if doesn’t exist) Copy RDF Refine unzipped folder to workspace directory Restart Open Refine Need to create column withVIAF URIs for export: "http://viaf.org/viaf/"+cell.recon.match.id
  • 14.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 14 Matching Subjects to LCSH Click RDF button in the top right corner, select ‘Add reconciliation service, Based on SPARQL endpoint’. Add following parameters: Name: LCSH Endpoint URL: http://sparql.freeyourmetadata.org/ Graph URI: http://id.loc.gov/authorities/subjects Type:Virtuoso Label properties: check only skos:prefLabel
  • 15.
    Martha BeatriceWebb Place ofbirth:Gloucester, England Place of death: Liphook, Hampshire, England Life dates: 1858-1943 Epithet: social reformer and historian Family name:Webb Image from: BeatriceWebb letters BeatriceWebb (1858 - 1943). Fabian Socialist, social reformer, writer, historian, diarist.Wife, collaborator and assistant of SidneyWebb, later Lord Passfield.Together they contributed to the radical ideology first of the Liberal Party and later of the Labour Party. from: BeatriceWebb,A summer holiday in Scotland, 1884. BeatriceWebb (1858-1943), nee Potter, social reformer and diarist. Married to SidneyWebb, pioneers of social science. She was involved in many spheres of political and social activity including the Labour Party, Fabianism, social observation, investigations into poverty, development of socialism, the foundation of the National Health Service and post war welfare state, the London School of Biographical Notes Works Our Partnership My Apprenticeship The case for the factory acts BeatriceWebb’s diaries; edited by MargaretCole The Diary Knows http://dbpedia.org/page/George_Bernard_Shaw http://dbpedia.org/page/Sidney_Webb,_1st_Bar on_Passfield 15Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/
  • 16.
    Contact Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 16 Adrian Stevenson SeniorTechnical Coordinator Jisc Manchester http://www.jisc.ac.uk [email protected] http://www.twitter.com/adrianstevenson https://www.linkedin.com/in/adrianstevenson
  • 17.
    Tools for DataManipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 17 CC License This presentation available under creative commons Non Commercial-Share Alike: http://creativecommons.org/licenses/by-nc/2.0/uk/

Editor's Notes

  • #4 Hub used mainly for linked data project where we wanted to match to VIAF. Will come to later in the workshop.
  • #5 Review options on import screen Talk through the example data and the purpose of the columns
  • #6 Facet
  • #7 Mention that facet on duplicates for person URI doesn’t necc mean want to remove the rows as the Arc Res URIs may be different. Depends what wanting to do. More tutorials http://kb.refinepro.com/2011/08/remove-duplicate.html http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Deduplicate_entries
  • #8 Explain why might want to reconcile to VIAF. Other recon services at https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
  • #9 http://www.w3.org/DesignIssues/LinkedData.html
  • #12 If any of cells in the columns are blank, the merge will fail for that row. To fix, create a facet of blank cells with "Text Facet" ⇒ "Customized Facets" ⇒ "Facet by Blank". Then use "Edit Cells" ⇒ "Transform ..." and enter a string with a space: ' '. This also has it’s limitations as some names have inconsistent number of commas.
  • #13 Talk through faceting of judgement. How check and accept reconclied rows. Explain why this is why have included Hub URI and ArcRes URI for manual checking
  • #16 Mock-up of the LInking Lives interface shows the way data is brought together.