SlideShare a Scribd company logo
Hand me the data!
What you should know as a humanities researcher
before asking for data from a web archive
Hello!
About Me
• A humanities background
• MA Information Studies, Aarhus University + some
extra years studying Software Construction
• Interest in web technology and web design since
the Netscape era
• On the NetLab team for many years, I try to understand our
national web archive in detail and help researchers explore
and make use of it as a scholarly source
• My programming languages in order of use: R, Python, PHP,
Javascript, Java
What this presentation is about
1. What is webarchive data?
2. How do you get it? (Data Delivery)
3. How do you handle it? (Data Management)
4. Do you need to be a computer scientist?
What is web archive data?
Index
From a user interface to a data exploration/programming environment
Case: mHealth Project
• The aim of this project is to provide an overview of
how mHealth has been discussed on Danish
webpages over the last 10 years
• mHealth is a term used for mobile health
technologies; it is a constantly evolving term
covering health apps, healthcare technology etc.
More info: http://www.netlab.dk/research/it-developer-projects/mhealth-in-denmark-findings-from-the-web-archive-2/
Case: mHealth Project
• Full-text search in the web archive showed many
thousand pages (in both Danish and English
language)
• No easy way to determine relevant content by
browsing
Wednesday 6 May: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive, Ulrich Have, NetLab/DIGHUMLAB, Aarhus University
Data Delivery
• What kind of data types are in the index?
• What kind of data do we need?
• ETL: Links, text and metadata
• CSV, no WARC files
• Medium size
• Quick delivery
Data Delivery
• ETL: Extract-Transform-Load
Data Delivery
• The mHealth project was one of the first projects to have
data delivered from the Danish web archive
• A data delivery contract making the project owner
responsible for the data once delivered
• This means that data can be studied in new ways but
also requires a data management mindset and IT skills
• Plus some infrastructure… but not much.
Wednesday 6 May: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive, Ulrich Have, NetLab/DIGHUMLAB, Aarhus University
Data Management
• Data delivered by secure file sharing service hosted by
The Danish e-infrastructure Cooperation (DeiC)
• The project owner and all participants have to adhere to
data laws
• Data placed on central university server
• Data can only be accessed by the project team and only via
the university network (cable or VPN)
Data Management
GitHub: Code and ReportsUniversity: Data
Project Team
Exploratory data analysis (EDA)
Researchers can
perform their own
interactive
exploration of
data by setting
variables such as
year.
Exploratory data analysis (EDA)
Project
deliverables can
be created
interactively and
exported to files.
No need for
special servers. A
variety of
analyses can be
made.
Exploratory data analysis (EDA)
• Notebook for interaction with the data
• Data is stored centrally but code is executed on
researcher’s laptop for instant results
• The researcher can easily explore the data
• Different kinds of analysis
• Different kinds of exports and visualisations
• Build a workflow, share the workflow
Conclusion
The mHealth case shows:
• Data delivery from a national web archive brings
freedom of analysis because you are not restricted to a
certain research infrastructure
• Having data shifts concerns towards being a manager
not just a user
• Useful for teams because code and reports can be
shared (for instance on GitHub)
Conclusion
Please consult your web archive:
• Ask if they offer data (some have collections for
download)
• Ask if they offer IT services for a specific research task
• Consult your own institution about data management
• Remember to give credit – web archives are doing a
great service with often very limited resources!
Thank you!
To get in touch:
• Web: www.netlab.dk
• E-mail: ukh@cc.au.dk
• Twitter: @ulrich_netlab
• Facebook: Ulrich Have NetLab

More Related Content

PPTX
WG5: A data wrangling experiment
WARCnet
 
PPTX
Tuesday 5 May 2020: Contextualizing and engaging with Web domains, Valérie Sc...
WARCnet
 
PDF
Web Archive Research Skills and Tools Survey (WARST)
WARCnet
 
PPTX
Bingham, De Wild & Aasman Presentation
WARCnet
 
PPTX
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
WARCnet
 
PPT
A researcher driven data description for the archived web: Why and how?
WARCnet
 
PPTX
Webber Presentation
WARCnet
 
PDF
Maurer Presentation - WARCnet Spring Meeting 2021
WARCnet
 
WG5: A data wrangling experiment
WARCnet
 
Tuesday 5 May 2020: Contextualizing and engaging with Web domains, Valérie Sc...
WARCnet
 
Web Archive Research Skills and Tools Survey (WARST)
WARCnet
 
Bingham, De Wild & Aasman Presentation
WARCnet
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
WARCnet
 
A researcher driven data description for the archived web: Why and how?
WARCnet
 
Webber Presentation
WARCnet
 
Maurer Presentation - WARCnet Spring Meeting 2021
WARCnet
 

What's hot (20)

PPT
The Danish case: What does the danish web talk about
WARCnet
 
PDF
lodlam summit session browsable linked data
Enno Meijers
 
PPTX
Making social science more reproducible by encapsulating access to linked data
Albert Meroño-Peñuela
 
ODP
DBpedia: A Public Data Infrastructure for the Web of Data
Sebastian Hellmann
 
PDF
20170501 Distributed Network of Digital Heritage Information
Enno Meijers
 
PPTX
Tuesday 5 May: Definition and Representation of National Web Domains across W...
WARCnet
 
PPTX
QB'er demonstration
CLARIAH
 
PDF
DBpedia/association Introduction The Hague 12.2.2016
Sebastian Hellmann
 
PDF
Intro to Web Science (Fall 2013)
Rensselaer Polytechnic Institute
 
PDF
Linked Data
Anja Jentzsch
 
PDF
The ARIADNE interoperability framework, component architecture and registry s...
ariadnenetwork
 
PPTX
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
Andrea Bollini
 
PPSX
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
WARCnet
 
PDF
Wikidata
Anja Jentzsch
 
PDF
Open Access of Research Data - The Present and Future Situation in Germany
ariadnenetwork
 
PPT
Semantic Web special interest group meeting - IFLA WLIC 2012
Figoblog
 
PPTX
Open Science Days 2014 - Becker - Repositories and Linked Data
Pascal-Nicolas Becker
 
PPT
AddressingHistory - Crowdsourcing historical data and maps
EDINA, University of Edinburgh
 
PDF
Session 1.6 slovak public metadata governance and management based on linke...
semanticsconference
 
PDF
DBpedia Tutorial - Feb 2015, Dublin
m_ackermann
 
The Danish case: What does the danish web talk about
WARCnet
 
lodlam summit session browsable linked data
Enno Meijers
 
Making social science more reproducible by encapsulating access to linked data
Albert Meroño-Peñuela
 
DBpedia: A Public Data Infrastructure for the Web of Data
Sebastian Hellmann
 
20170501 Distributed Network of Digital Heritage Information
Enno Meijers
 
Tuesday 5 May: Definition and Representation of National Web Domains across W...
WARCnet
 
QB'er demonstration
CLARIAH
 
DBpedia/association Introduction The Hague 12.2.2016
Sebastian Hellmann
 
Intro to Web Science (Fall 2013)
Rensselaer Polytechnic Institute
 
Linked Data
Anja Jentzsch
 
The ARIADNE interoperability framework, component architecture and registry s...
ariadnenetwork
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
Andrea Bollini
 
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
WARCnet
 
Wikidata
Anja Jentzsch
 
Open Access of Research Data - The Present and Future Situation in Germany
ariadnenetwork
 
Semantic Web special interest group meeting - IFLA WLIC 2012
Figoblog
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Pascal-Nicolas Becker
 
AddressingHistory - Crowdsourcing historical data and maps
EDINA, University of Edinburgh
 
Session 1.6 slovak public metadata governance and management based on linke...
semanticsconference
 
DBpedia Tutorial - Feb 2015, Dublin
m_ackermann
 
Ad

Similar to Wednesday 6 May: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive, Ulrich Have, NetLab/DIGHUMLAB, Aarhus University (20)

PPSX
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
PPTX
Building an electronic repository and archives on Dataverse in the European O...
vty
 
PDF
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
PPT
Ict uses in libraries
Liaquat Rahoo
 
PDF
20191210 NDLI KEDL2019 Building the dutch digital heritage network
Enno Meijers
 
PPTX
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
PPTX
Introduction to Web Technology by Mahesh Sharma
Ashmita Tuition Center
 
PPTX
Research Software Engineering Inside and Outside the Library
Patrick McCann
 
PDF
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research
 
PPTX
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM
 
PDF
A distributed network of digital heritage information by Enno Meijers - Europ...
Europeana
 
PPTX
RDM Programme at University of Edinburgh
Historic Environment Scotland
 
PPTX
Using Archivemedia to preserve research data
ARDC
 
PPTX
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Jenny Mitcham
 
PPTX
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
datacite
 
PPTX
Staffing Research Data Services at University of Edinburgh
Robin Rice
 
PPTX
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
DataDryad
 
PDF
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
Vlaamse Vereniging voor Bibliotheek, Archief & Documentatie vzw (VVBAD)
 
PPTX
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
NASIG
 
PPTX
"Filling the Digital Preservation Gap" with Archivematica
Jenny Mitcham
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
Building an electronic repository and archives on Dataverse in the European O...
vty
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
Ict uses in libraries
Liaquat Rahoo
 
20191210 NDLI KEDL2019 Building the dutch digital heritage network
Enno Meijers
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Introduction to Web Technology by Mahesh Sharma
Ashmita Tuition Center
 
Research Software Engineering Inside and Outside the Library
Patrick McCann
 
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM
 
A distributed network of digital heritage information by Enno Meijers - Europ...
Europeana
 
RDM Programme at University of Edinburgh
Historic Environment Scotland
 
Using Archivemedia to preserve research data
ARDC
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Jenny Mitcham
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
datacite
 
Staffing Research Data Services at University of Edinburgh
Robin Rice
 
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
DataDryad
 
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
Vlaamse Vereniging voor Bibliotheek, Archief & Documentatie vzw (VVBAD)
 
The Canadian Linked Data Initiative: Charting a Path to a Linked Data Future
NASIG
 
"Filling the Digital Preservation Gap" with Archivematica
Jenny Mitcham
 
Ad

More from WARCnet (20)

PPTX
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
PPTX
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
PDF
2022 Visit Royal Danish Library Ditte Laursen.pdf
WARCnet
 
PDF
20221015 introduction to panel Ditte Laursen.pdf
WARCnet
 
PPTX
WARCnet_2022.pptx
WARCnet
 
PPTX
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet
 
PPTX
Warcnet 2022_final.pptx
WARCnet
 
PDF
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
WARCnet
 
PDF
Hegarty-WARCNet2022-slides.pdf
WARCnet
 
PDF
20221018_Panel_Covid_WARCnet_closing_conference.pdf
WARCnet
 
PPTX
Millward - We cannot put this off any longer - upload.pptx
WARCnet
 
PPTX
Balbi_Keynote_AarhusWARCnet.pptx
WARCnet
 
PPTX
Reporting from a Short-Term Network Stay at the BnF and INA
WARCnet
 
PPTX
Post WARCnet
WARCnet
 
PPTX
The WARCnet Code Book of web archive data formats
WARCnet
 
PDF
Web scraping using semi-automated browsing
WARCnet
 
PPTX
Working Group 6 discussion
WARCnet
 
PPTX
What’s in a URL? Analysing COVID-19 web archive collections
WARCnet
 
PPTX
Working Group 2 on transnational events
WARCnet
 
PDF
Whose Archives? Reflections on ethics and the cultural significance of web ar...
WARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
WARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
WARCnet
 
WARCnet_2022.pptx
WARCnet
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet
 
Warcnet 2022_final.pptx
WARCnet
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
WARCnet
 
Hegarty-WARCNet2022-slides.pdf
WARCnet
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
WARCnet
 
Millward - We cannot put this off any longer - upload.pptx
WARCnet
 
Balbi_Keynote_AarhusWARCnet.pptx
WARCnet
 
Reporting from a Short-Term Network Stay at the BnF and INA
WARCnet
 
Post WARCnet
WARCnet
 
The WARCnet Code Book of web archive data formats
WARCnet
 
Web scraping using semi-automated browsing
WARCnet
 
Working Group 6 discussion
WARCnet
 
What’s in a URL? Analysing COVID-19 web archive collections
WARCnet
 
Working Group 2 on transnational events
WARCnet
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
WARCnet
 

Recently uploaded (20)

PPTX
Raksha Bandhan Celebrations PPT festival
sowmyabapuram
 
PPTX
THE school_exposure_presentation[1].pptx
sayanmondal3500
 
PPTX
Working-with-HTML-CSS-and-JavaScript.pptx
badalsenma5
 
PDF
Advanced-Web-Design-Crafting-the-Future-Web (1).pdf
vaghelavidhiba591
 
PPTX
A Power Point Presentaion of 2 test match
katarapiyush21
 
PPTX
Influencing Factors of Business Environment of Vegetables Selling Business
auntorkhastagirpujan
 
PPTX
“Mastering Digital Professionalism: Your Online Image Matters”
ramjankhalyani
 
DOCX
Ss Peter & Paul Choir Formation Training
kiambutownshipsecond
 
PPTX
Introductions to artificial intelligence
rakshjain77
 
PPTX
Bob Stewart Journey to Rome 07 30 2025.pptx
FamilyWorshipCenterD
 
PPTX
Joy And Peace In All Circumstances.pptx
FamilyWorshipCenterD
 
PPTX
IBA DISTRICT PIR PRESENTATION.POWERPOINT
ROGELIOLADIERO1
 
PDF
Chapter-52-Relationship-between-countries-at-different-levels-of-development-...
dinhminhthu1405
 
PPTX
Rotary_Fundraising_Overview_Updated_new video .pptx
allangraemeduncan
 
PDF
SXSW Panel Picker: Placemaking: Culture is the new cost of living
GabrielCohen28
 
PDF
Thu Dinh - CIE-RESEARCH-METHODS-SLIDES-sample-extract.pptx.pdf
dinhminhthu1405
 
PDF
Something I m waiting to tell you By Shravya Bhinder
patelprushti2007
 
PPTX
PHILIPPINE LITERATURE DURING SPANISH ERA
AllizaJoyMendigoria
 
PPTX
Remote Healthcare Technology Use Cases and the Contextual Integrity of Olde...
Daniela Napoli
 
PDF
Media Training for Authors: Producing Videos & Nailing Interviews
Paula Rizzo
 
Raksha Bandhan Celebrations PPT festival
sowmyabapuram
 
THE school_exposure_presentation[1].pptx
sayanmondal3500
 
Working-with-HTML-CSS-and-JavaScript.pptx
badalsenma5
 
Advanced-Web-Design-Crafting-the-Future-Web (1).pdf
vaghelavidhiba591
 
A Power Point Presentaion of 2 test match
katarapiyush21
 
Influencing Factors of Business Environment of Vegetables Selling Business
auntorkhastagirpujan
 
“Mastering Digital Professionalism: Your Online Image Matters”
ramjankhalyani
 
Ss Peter & Paul Choir Formation Training
kiambutownshipsecond
 
Introductions to artificial intelligence
rakshjain77
 
Bob Stewart Journey to Rome 07 30 2025.pptx
FamilyWorshipCenterD
 
Joy And Peace In All Circumstances.pptx
FamilyWorshipCenterD
 
IBA DISTRICT PIR PRESENTATION.POWERPOINT
ROGELIOLADIERO1
 
Chapter-52-Relationship-between-countries-at-different-levels-of-development-...
dinhminhthu1405
 
Rotary_Fundraising_Overview_Updated_new video .pptx
allangraemeduncan
 
SXSW Panel Picker: Placemaking: Culture is the new cost of living
GabrielCohen28
 
Thu Dinh - CIE-RESEARCH-METHODS-SLIDES-sample-extract.pptx.pdf
dinhminhthu1405
 
Something I m waiting to tell you By Shravya Bhinder
patelprushti2007
 
PHILIPPINE LITERATURE DURING SPANISH ERA
AllizaJoyMendigoria
 
Remote Healthcare Technology Use Cases and the Contextual Integrity of Olde...
Daniela Napoli
 
Media Training for Authors: Producing Videos & Nailing Interviews
Paula Rizzo
 

Wednesday 6 May: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive, Ulrich Have, NetLab/DIGHUMLAB, Aarhus University

  • 1. Hand me the data! What you should know as a humanities researcher before asking for data from a web archive
  • 2. Hello! About Me • A humanities background • MA Information Studies, Aarhus University + some extra years studying Software Construction • Interest in web technology and web design since the Netscape era • On the NetLab team for many years, I try to understand our national web archive in detail and help researchers explore and make use of it as a scholarly source • My programming languages in order of use: R, Python, PHP, Javascript, Java
  • 3. What this presentation is about 1. What is webarchive data? 2. How do you get it? (Data Delivery) 3. How do you handle it? (Data Management) 4. Do you need to be a computer scientist?
  • 4. What is web archive data? Index From a user interface to a data exploration/programming environment
  • 5. Case: mHealth Project • The aim of this project is to provide an overview of how mHealth has been discussed on Danish webpages over the last 10 years • mHealth is a term used for mobile health technologies; it is a constantly evolving term covering health apps, healthcare technology etc. More info: http://www.netlab.dk/research/it-developer-projects/mhealth-in-denmark-findings-from-the-web-archive-2/
  • 6. Case: mHealth Project • Full-text search in the web archive showed many thousand pages (in both Danish and English language) • No easy way to determine relevant content by browsing
  • 8. Data Delivery • What kind of data types are in the index? • What kind of data do we need? • ETL: Links, text and metadata • CSV, no WARC files • Medium size • Quick delivery
  • 9. Data Delivery • ETL: Extract-Transform-Load
  • 10. Data Delivery • The mHealth project was one of the first projects to have data delivered from the Danish web archive • A data delivery contract making the project owner responsible for the data once delivered • This means that data can be studied in new ways but also requires a data management mindset and IT skills • Plus some infrastructure… but not much.
  • 12. Data Management • Data delivered by secure file sharing service hosted by The Danish e-infrastructure Cooperation (DeiC) • The project owner and all participants have to adhere to data laws • Data placed on central university server • Data can only be accessed by the project team and only via the university network (cable or VPN)
  • 13. Data Management GitHub: Code and ReportsUniversity: Data Project Team
  • 14. Exploratory data analysis (EDA) Researchers can perform their own interactive exploration of data by setting variables such as year.
  • 15. Exploratory data analysis (EDA) Project deliverables can be created interactively and exported to files. No need for special servers. A variety of analyses can be made.
  • 16. Exploratory data analysis (EDA) • Notebook for interaction with the data • Data is stored centrally but code is executed on researcher’s laptop for instant results • The researcher can easily explore the data • Different kinds of analysis • Different kinds of exports and visualisations • Build a workflow, share the workflow
  • 17. Conclusion The mHealth case shows: • Data delivery from a national web archive brings freedom of analysis because you are not restricted to a certain research infrastructure • Having data shifts concerns towards being a manager not just a user • Useful for teams because code and reports can be shared (for instance on GitHub)
  • 18. Conclusion Please consult your web archive: • Ask if they offer data (some have collections for download) • Ask if they offer IT services for a specific research task • Consult your own institution about data management • Remember to give credit – web archives are doing a great service with often very limited resources!
  • 19. Thank you! To get in touch: • Web: www.netlab.dk • E-mail: [email protected] • Twitter: @ulrich_netlab • Facebook: Ulrich Have NetLab

Editor's Notes

  • #2: Hello everyone. Welcome to this short presentation: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive
  • #3: Hello, my name is Ulrich. I am a research IT developer at NetLab. I work with both researchers and developers. On a daily basis I assist researcher from the humanities and social science in using web archive data in their research. I have a humanities background from Aarhus University but also spent some time in computer science in a program called Software Construction. My own interest in the web dates back to 1996. It was a glorious time when we used Netscape Navigator and a few of us were using e-mail.
  • #4: These are some of the things I will cover: What is web archive data? How do you get it and how do you handle it? And do you need to be a computer scientist? I have to note that this account is about a danish research project using the Danish web archive. The Danish web archive is a national archive run by the Royal Danish Library under the legal deposit law. It is only open to researchers because of laws governing personal data and copyright. The access points are a Wayback Machine url-search and full-text search offering a few basic search options. Not all web archives are the same; they offer different user interfaces and different possibilities when it comes to data analysis.
  • #5: We normally browse web archive data. This data is just like any other web content. It is being served by a server and rendered in your browser. Chances are that your national web archive has created some collections or aggregate data. They may also have created an index of all the content. This means that a parser has been parsing the information and put it into a structured form. The information can now be searched in various ways and particular types of data can be explored.
  • #6: The mHealth project is an interesting case because it has a mixed team of health and communication researchers. The project contacted NetLab for help in finding a way to explore “mHealth” in 10 years of web data. mHealth is a term used for mobile health technologies. mHealth is part of the larger domain of eHealth. It has no authoritative definition and is constantly evolving as new technologies and practices emerge.
  • #7: When the team approaced us they had already been granted access to the national web archive and had tried to do some research.The usual browsing of the web archive quickly turned out to be very hard because of the time frame of 10 years. There were simply too many pages. Furthermore, it appeared that there was a variety of uses of the term mHealth by different actors in different contexts. So it made sense to ask for the complete data in order to study it using computational methods.
  • #8: So we tried asking for the data...
  • #9: Exactly what do we mean by data? Clearly when you are using a web archive you are interacting with data. But you don’t have a clue about how data is stored in the archive or if any metadata is available. So your first step would be to understand your web archive in more detail. In the mHealth project we first looked at which types of data existed in the archive. Luckily all content has been indexed which means that there is a big pool of data. It is usually hidden to the user but as it turns out it can be valuable for doing computational analysis. The mHealth team decided to ask for these data types: links, text and metadata. It also tried to get the information in CSV files instead of WARC files. This choice was a good one because we would get medium sized files that could be quickly delivered.
  • #10: This work would lead to a specifications document called the ETL document. The ETL clearly specifies queries and data fields. The queries can be used to define a certain collection or corpus for example ”all documents in Danish containing the word mHealth excluding documents from Twitter.” The data fields define data sets for metadata, text and links. The links and text derive from the web pages. The metadata can be information about the page such as date, content type, title and so on. As you will see, having a data set w. metadata is very useful for exploring a collection.
  • #11: The mHealth project was one of the first projects to have data delivered from the Danish web archive. Through agreement the web archive would deliver data to the project owner who would then become responsible for the data. One of the first steps was to make sure that data could be stored and shared safely. The team would have to develop a data management mindset and some IT skills. Plus some infrastructure was needed, but not much.
  • #12: To sum it up, having data from a web archive comes with some extra requirements. Such as IT infrastructure, software and legal issues.
  • #13: Data was delivered via secure file sharing service hosted by DeiC, The Danish e-infrastructure Cooperation (DeiC), a national body that coordinates Danish digital infrastructure as an umbrella for the eight Danish universities. It is possible to send many gigabytes of data per delivery. The project owner and all participants have to comply with data laws. This means that data must be stored and transferred in a secure way. The mHealth project chose to use a university server which was only accesible to the team and only via the university network.
  • #14: So this is how the project work was organised: data was stored on the university server and only accessible to the project team. In order to comply w. data laws, all access was logged. The code and reports were stored in a private repository on GitHub. Team members could then share work documents in a consistent workflow.
  • #15: The main focus was on exploratory data analysis. The central tool for the mHealth team was the notebook. A notebook is a document with both text and code. Various programming languages support the notebook method, in the case of mHealth we used the statistical programming language R to produce the notebooks. One advantage of the notebook is that it allows researchers to use a shared notebook to perform their own interactive exploration of data by setting variables and executing code. For instance, the mHealth notebook would allow for a year-by-year analysis of the data. The first analysis would use the metadata. Many insights into mHealth could be made by simply looking at the dataset from different angles: which domains had most pages? Which month or year stood out?
  • #16: Another advantage was that the team could produce their own project deliverables directly in the notebook. No need for special software or servers.
  • #17: So no, you don’t need to be a computer scientist! As we learned from the mHealth project exploratory data analysis is possible for all team members with very little effort in terms of IT skills and data management. Using the right setup and a data management mindset the infrastructure for doing computational analysis is available to any small to mid size organisation. Modern computer technology, networks and security technology makes it possible to safely work on web archive data. The researcher can explore data, perform different kinds of analysis, and create visualizations and derived datasets. Most importantly the research team can work better as a team by building a shared workflow.
  • #18: In conclusion the mHealth case shows Data delivery from a national web archive brings freedom of analysis because you are not restricted to a certain research infrastructure. Having data shifts concerns towards being a manager not just a user. This requires data management skills something that is of value to researchers not just in working with web archives. Having a shared workflow is useful for teams. Just like in software development code and documents can be stored in a repository on the internet or in the organisation.
  • #19: Finally I hope to have inspired some of you to learn more about your national web archive. Ask them if they offer data. Some, like the UK Web Archive, have datasets that you can download. Maybe your web archive has created aggregate datasets or even an index. Ask if they offer IT services for a specific research task. They often know the possibilities and limitations of their archive. Use your own instutition to learn and educate yourself in IT and data management. And don’t forget to give credit - web archives are doing a great service with often very limited resources!
  • #20: Thank you!