SlideShare a Scribd company logo
Oxford e-Research Centre
University of Oxford, UK
9th Conference on
Open Access
Scholarly Publishing
Lisbon, Portugal
20 Sept 2017
© David Shotton 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
david.shotton@opencitations.net
David Shotton
The Initiative for Open Citations
and the OpenCitations Corpus
2013 “Free scholarly citation data!”
Donatello’s
John the Baptist
Fifth Conference on
Open Access
Scholarly Publishing
Riga, Latvia
20 September 2013
. . . the voice of one
crying in the wilderness
2016 “Release open citation data!”
Eighth Conference on
Open Access
Scholarly Publishing
Virginia, USA
20 September 2016
Dario Taraborelli
Head of Research,
Wikimedia Foundation
2017 The year of success - citation data is freed!
n  Two fantastic success stories
§  The Initiative for Open Citations https://i4oc.org/
§  The OpenCitations Corpus http://opencitations.net
n  While related, these initiatives are separate and distinct
n  Two Italian heros: Dario Taraborelli and Silvio Peroni
Crossref - providing the fundamental infrastructure
https://www.crossref.org/
n  Crossref is the registration agency of Digital Object Identifiers (DOIs) for
scholarly publications (journal articles). Most publishers are members
n  Crossref hold metadata about articles, made available via its REST API
https://www.crossref.org/services/metadata-delivery/rest-api/
n  Crossref has its own heros:
Ed Pentz Executive Director Geoff Bilder Director of Strategic Initiatives
The Initiative for Open Citations
n  The Initiative for Open Citations is a collaboration between scholarly publishers,
researchers, and other interested parties to promote the unrestricted availability
of scholarly citation It does not host citation data!
n  Launched April 6, 2017 Web site https://i4oc.org
n  Spearheaded by Dario Taraborelli of the Wikimedia Foundation
§  with help from Jonathan Dugan, Martin Fenner, Jan Gerlach,
Catriona MacCallum, Daniel Mietchen, Cameron Neylon,
Mark Patterson, Michelle Paulson, Silvio Peroni and myself
n  Six founding organizations:
§  The Wikimedia Foundation, PLOS, eLife, DataCite, OpenCitations,
and the Centre for Culture and Technology at Curtin University
n  Within a short space of time, I4OC has persuaded most of the major scholarly
publishers to make their reference lists open, so that the proportion of all
references submitted to Crossref that are now open has risen from 1% to
over 45%!
Publishers supporting I4OC and opening their references
n  49 scholarly publishers have opened their references, including the following
major ones:
n  Commercial publishers
§  Association for Computing Machinery, BMJ, De Gruyter, eLife, EMBO
Press, Hindawi, IOS Press, PeerJ, Pensoft Publishers, Portland Press,
Public Library of Science, Springer Nature, Taylor & Francis, Wiley
n  University and scholarly presses
§  Cambridge University Press, Cold Spring Harbor Laboratory Press,
Company of Biologists, Edinburgh University Press, MIT Press,
Rockefeller University Press
n  Learned societies
§  American Association for the Advancement of Science (AAAS),
American Physical Society, American Society for Cell Biology,
International Union of Crystallography, Proceedings of the
National Academy of Sciences (PNAS), Royal Society of Chemistry,
The Royal Society
Organizations and institutions who have endorsed I4OC
n  Funders
§  Sloan Foundation, Bill and Melinda Gates Foundation, Jisc, Simons
Foundations Science Sandbox, Wellcome Trust
n  Research organizations
§  Allen Institute for Artificial Intelligence, Microsoft Research
n  Libraries
§  Association of Research Libraries, British Library, California Digital
Library, Harvard Library Office for Scholarly Communication, LIBER,
Max Planck Digital Library
n  Bibliographic / bibliometric organizations
§  Altmetrics, CiteSeerX, DBLP Computer Science Bibliography,
ImpactStory, Zotero
n  Other organizations
§  Dryad Data Repository, Figshare, Internet Archive, Mozilla, OASPA,
Open Knowledge International, OpenAire, ScienceOPEN, Wiki Education
Foundation, Wikimedia Deutchland, Wikimedia UK
I4OC – what’s left to do
n  Almost 50% of Crossref-deposited references, from ~16 million articles, are
now open, leaving about half that are still closed
n  Crossref has over 7000 members, and it’s the long tail of smaller
publisher-members that are not presently opening their references
n  This includes a large number of Open Access publishers!
§  Just because an article is published as Open Access and its references
are available on the publisher’s web site, this is not sufficient for the bulk
harvesting and analysis of citation data
§  Imagine the effort of going to each site in turn and scraping reference lists
presented in a wide variety of differing formats and DTD markups!
n  Many small scholarly publishers are not even members of Crossref
n  But help is at hand:
§  OASPA has a sponsored agreement with Crossref whereby its smaller
members can join Crossref via OASPA, with OASPA covering the cost of
a proportion of their DOIs
How to open references using the Crossref Cited-by service
n  The Crossref Cited-by service is a free service that helps publishers find out who
is citing their articles
n  Publishers submit article reference lists to Crossref along with other metadata
n  However, the Crossref default is that these reference lists are closed, not OPEN!
n  To open their article reference lists, a publisher needs to do one of two things:
§  Either contact support@crossref.org and ask them to turn on reference
distribution for all the DOI prefixes they manage
§  Or, in the article metadata they submit to Crossref, set the
<reference_distribution_opt> span element to “any” for each DOI deposit
where they want to make references openly available
n  It’s that easy!!!
ZooKeys use of Crossref open citation data
The OpenCitations Corpus
n  OpenCitations (http://opencitations.net) is a small infrastructure organization
directed by myself and Silvio Peroni
n  Its primary purpose is to host and develop the OpenCitations Corpus (OCC),
a Linked Open Data repository of scholarly bibliographic citation data
n  A founding member of I4OC, it is distinct and separate from that initiative
n  The first OCC prototype was created at Oxford in 2011 with Jisc funding – see
my 2013 COASP talk in Riga (http://zeeba.tv/the-open-citations-corpus/)
n  A new instance of the OCC, based on our revised metadata schema, was
created by Silvio Peroni and is now running at the University of Bologna
n  It has been ingesting scholarly references continuously since early July 2016
n  OCC now provides the largest RDF collection of open citation data on the Web
§  Currently holds references from ~240,000 citing bibliographic resources
§  Provides >10 million citation links to over 5.5 million cited resources
§  These data are freely available under a CC0 public domain waiver
Source data - reference lists from PubMed Central
n  At present, the ingested reference lists are obtained by processing the XML
sources of papers in the Open Access subset of PubMed Central
n  These are parsed to yield authors, titles, journal names, etc.
§  We ask for the most recent papers first
§  Thus, as citing papers, the OCC mainly includes articles published in
2016 and 2017
n  The identifiers of all the citing papers already processed are stored locally, so
as not to request the same XML source twice
n  We then call several external APIs, including Crossref and ORCID, to obtain
additional metadata describing the citing and cited papers and their authors
n  There are almost 1.7 million OA articles available in PubMed
§  So far we have harvested 14% . . .
The raw reference list data
n  The reference lists extracted from citing papers are made available in JSON:
{

"doi": "10.1007/s11892-016-0752-4",

"pmid": "27168063",

"pmcid": "PMC4863913",

"localid": "MED-27168063",

"curator": "BEE EuropeanPubMedCentralProcessor",

"source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",

"source_provider": "Europe PubMed Central”

"references": [

...


{

"bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using

a computational program and its relationship to autoreactive T cells,

Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, 

PMID: 19461125",

"pmid": "19461125",

"doi": "10.1093/intimm/dxp039",

"pmcid": "PMC2686615",

"process_entry": "True”

},

...

]

}
The citing paper's metadata and identifiers
A reference in the citing paper's reference list, with its own ids
The SPAR (Semantic Publishing and Referencing) Ontologies
FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for
describing bibliographic entities (books, articles, etc.)
CiTO, the Citation Typing Ontology - enables the characterization of
citations, both factually and rhetorically
BiRO, the Bibliographic Reference Ontology - an ontology to define
bibliographic records and references, and their compilation into
bibliographic collections and reference lists, respectively
http://www.sparontologies.net/
n  OCC data are then stored in RDF (JSON-LD) using the SPAR (Semantic
Publishing and Referencing) ontologies and other standard vocabularies
n  These SPAR ontologies include
Availability of the OpenCitations Corpus data
n  All the OpenCitations software is available on GitHub under an open license
n  The data in the OpenCitations Corpus are available in three different ways:
§  Direct access to bibliographic resources by means of their HTTP URIs
(via content negotiation), e.g. https://w3id.org/oc/corpus/br/1
§  Queries to our SPARQL endpoint: https://w3id.org/oc/sparql
§  Monthly dumps stored in Figshare: http://opencitations.net/download
n  Currently the OCC uses a good graph-based triplestore – Blazegraph
n  However, the virtual machine that hosts it is very limited in resources,
causing performance problems for demanding SPARQL queries
n  We plan soon to commission a new powerful physical server that should
provide a better user experience, and to develop additional user-friendly
interfaces for accessing the OCC data, including graphic visualizations of
citation networks
Use of the OpenCitations web site
n  Accesses to the OpenCitations web site and services:
The “corpus” and “sparql” pages have together gained 89% of the total accesses, showing that
people mainly access the OpenCitations Corpus to explore and use the data within it
Use of OpenCitations data stored on Figshare
What happened this summer?
n  Use of the OpenCitations social accounts
§  Twitter - https://twitter.com/opencitations
§  Wordpress Blog – https://opencitations.wordpress.com/
increased markedly following the launch of the Initiative for Open Citations
Who is using OpenCitations, and for what?
n  Organizations and projects that we know use OpenCitations resources include:
§  Wikidata - pulling citation data to enrich their pages
§  OpenAIRE – using OCC bibliographic resources info in OpenAIRE
§  LOC-DB - have adopted the OpenCitations data model for their database
§  Tomas Petricek of the Turing Institute - extending his Gamma Project
visualization software to handle OpenCitations’ RDF data
§  Ontotext.com - combining Springer's SciGraph data with OpenCitations
data using SPARQL federation
§  Anna Kamińska of the Polish Librarians Association - undertaking citation
network analysis of PLoS One research papers using data in the OCC
n  We can’t know who else is using OpenCitations resources unless they tell us!
§  Please let us know if you are!
n  On 10th September, Crossref blogged about our use of their REST API
§  https://www.crossref.org/blog/using-the-crossref-rest-api.-part-5-with-
opencitations/
Present status of OpenCitations
n  We have recently received a small
grant from the Sloan Foundation for the
OpenCitations Enhancement Project
§  This provides one year’s salary
for a postdoc to develop new user
interfaces, and new hardware to
enhance the OCC performance
n  We have just appointed Ivan Heibi to
work on the OCC with Silvio in Bologna
n  Silvio and Ivan will be commissioning
the new hardware next month
§  This will use parallel processing
to increase ingest rate 30-fold
n  We are in the process of appointing an
International Advisory Board to guide
the growth of OpenCitations
Enhancing the OpenCitations ingestion rate
n  OpenCitations current ingests ~8 million new citations per year
n  With 30 Raspberry Pis working in parallel as ingest machines, we anticipate
that this rate will increase to ~240 million new citations per year
n  By the end of 2018, OpenCitations should hold ~ 250 million citations,
compared to Web of Knowledge’s ~1.25 billion
n  Even this partial coverage will include citations of all important papers,
these critical papers being easily recognized because they are highly cited,
forming nodes in the citation graph with a large number of inward citation links
n  A further five-fold increase in ingest rate - significant but achievable with
additional hardware (and funding!) - will enable us to reach parity by 2020
Where will the references come from?
n  With the enhanced ingest rate, we will quickly consume all 1.7 million articles
in the Open Access Subset of PubMed Central
n  We will then start harvesting the references from the ~16 million articles
already made open at Crossref in response to the Initiative for Open Citations,
and the additional articles that I4OC now encourages other publishers to open
n  Possible additional significant sources of open citation data include
§  ArXiv (1.3 million preprints)
§  CiteSeerX (>120 million references from >6 million documents)
§  CitEc (11 million references from a million Economics papers)
n  References from pre-digital publications extracted by text mining, e.g.
§  In the Social Sciences, from the LOC-DB at the University of Mannheim
§  In Biological Taxonomy, mined into BioStor by Rod Page from the
Biodiversity Heritage Library, e.g. http://biostor.org/reference/105357
We are winning the battle for open scholarship!
david.shotton@opencitations.net
David Shotton
Silvio Peroni
silvio.peroni@opencitations.net
Website: http://opencitations.net
Email: contact@opencitations.net
Twitter: @opencitations
Blog: https://opencitations.wordpress.com
Website: https://i4oc.org/
Email: info@i4oc.org
Twitter: @i4oc_org
dtaraborelli@wikimedia.org
Dario Taraborelli
Mark Patterson
m.patterson@elifesciences.org
Catriona MacCallum
catriona.maccallum@hindawi.com

More Related Content

PDF
OpenCitations
University of Bologna
 
PDF
A document-inspired way for tracking changes of RDF data - The case of the Op...
University of Bologna
 
PDF
Freedom for bibliographic references: OpenCitations arise
University of Bologna
 
PPTX
When the Web of Linked Data Arrives
Richard Wallis
 
PPT
The SFX Framework for Context-Sensitive Reference Linking
Herbert Van de Sompel
 
PPTX
Data Designed for Discovery
OCLC
 
PPT
Linked Open Data for Libraries
Lukas Koster
 
PPTX
鏈結資料在圖書館的應用20131107
皓仁 柯
 
OpenCitations
University of Bologna
 
A document-inspired way for tracking changes of RDF data - The case of the Op...
University of Bologna
 
Freedom for bibliographic references: OpenCitations arise
University of Bologna
 
When the Web of Linked Data Arrives
Richard Wallis
 
The SFX Framework for Context-Sensitive Reference Linking
Herbert Van de Sompel
 
Data Designed for Discovery
OCLC
 
Linked Open Data for Libraries
Lukas Koster
 
鏈結資料在圖書館的應用20131107
皓仁 柯
 

What's hot (20)

PPTX
Reminiscing about interoperability
Herbert Van de Sompel
 
PPTX
How Libraries Use Publisher Metadata Redux (Steven Shadle)
Charleston Conference
 
PPTX
Multilingual presentation ifla 2013 08-19
Janifer Gatenby
 
PPT
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
Crossref
 
PPTX
Signposting for Repositories
Martin Klein
 
PDF
Towards a Machine-Actionable Scholarly Communication System
Herbert Van de Sompel
 
PPTX
How Libraries Use Publisher Metadata - Crossref Community Webinar
Crossref
 
PPT
towards interoperable archives: the Universal Preprint Service initiative
Herbert Van de Sompel
 
PPTX
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
Alison Hitchens
 
PDF
Linked open data and libraries
Alison Hitchens
 
PPTX
What is #LODLAM?! (revised January 2015)
Alison Hitchens
 
PPTX
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Herbert Van de Sompel
 
PPTX
Hiberlink: Investigating Reference Rot, December 2013
Herbert Van de Sompel
 
PPTX
Environmental trends and OCLC Research, a presentation at the University of N...
lisld
 
PDF
Verifiable, linked open knowledge that anyone can edit
Dario Taraborelli
 
PDF
A Clean Slate?
Herbert Van de Sompel
 
PPT
Open Annotation Collaboration Introduction
Timothy Cole
 
PPTX
Semantics as a service at EMBL-EBI
Simon Jupp
 
PPTX
The library in the life of the user
lisld
 
Reminiscing about interoperability
Herbert Van de Sompel
 
How Libraries Use Publisher Metadata Redux (Steven Shadle)
Charleston Conference
 
Multilingual presentation ifla 2013 08-19
Janifer Gatenby
 
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
Crossref
 
Signposting for Repositories
Martin Klein
 
Towards a Machine-Actionable Scholarly Communication System
Herbert Van de Sompel
 
How Libraries Use Publisher Metadata - Crossref Community Webinar
Crossref
 
towards interoperable archives: the Universal Preprint Service initiative
Herbert Van de Sompel
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
Alison Hitchens
 
Linked open data and libraries
Alison Hitchens
 
What is #LODLAM?! (revised January 2015)
Alison Hitchens
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
Herbert Van de Sompel
 
Hiberlink: Investigating Reference Rot, December 2013
Herbert Van de Sompel
 
Environmental trends and OCLC Research, a presentation at the University of N...
lisld
 
Verifiable, linked open knowledge that anyone can edit
Dario Taraborelli
 
A Clean Slate?
Herbert Van de Sompel
 
Open Annotation Collaboration Introduction
Timothy Cole
 
Semantics as a service at EMBL-EBI
Simon Jupp
 
The library in the life of the user
lisld
 
Ad

Similar to The Initiative for Open Citations and the OpenCitations Corpus (20)

PPTX
David Shotton - OpenCon Oxford, 1st Dec 2017
Crossref
 
PPTX
finde datasets repository.pptx
hasanrdhaiwi
 
PPTX
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG: connecting the knowledge community
 
PPTX
Open Access: an introduction
Elizabeth Yates
 
PPTX
Open data sources in VOSviewer
Nees Jan van Eck
 
PPTX
The university library as a support for the institutional research identity
Infobiblio_es Información Bibliográfica
 
PPTX
2013 CrossRef Annual Meeting, How CrossRef has Accelerated Science and Its Pr...
Crossref
 
PPTX
Possible ways of getting oneself abreast of current literature
Mythili Srinivasan
 
PPTX
Visualizing science based on open data sources
Nees Jan van Eck
 
PPTX
University at Albany Lunch and Learn
rachelmccullough
 
PPTX
Finding Insights in Article-Level Metrics for Research Evaluation
Richard Cave
 
PPT
PLoS - Why It is a Model to be Emulated
Philip Bourne
 
PPTX
Syracuse Lunch and Learn
rachelmccullough
 
PDF
Crossref/OASPA Publishers
Crossref
 
PPTX
Postgraduate orientation 6th june 2017
Debs Martindale
 
PDF
Open Bibliography, Citations and Scholarship
benosteen
 
PPTX
A Strategy for Sharing Your Research: Make Your Work Open Access
Sunghae Ress
 
PPTX
Web Today, Good Tomorrow? Transactional archiving of web content
Peter Burnhill
 
PDF
The role of open access with regards to bibliometrics in the merit and resour...
Gustaf Nelhans
 
PDF
Open Access + Preprints for Scholars and Journals
Scholastica
 
David Shotton - OpenCon Oxford, 1st Dec 2017
Crossref
 
finde datasets repository.pptx
hasanrdhaiwi
 
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG: connecting the knowledge community
 
Open Access: an introduction
Elizabeth Yates
 
Open data sources in VOSviewer
Nees Jan van Eck
 
The university library as a support for the institutional research identity
Infobiblio_es Información Bibliográfica
 
2013 CrossRef Annual Meeting, How CrossRef has Accelerated Science and Its Pr...
Crossref
 
Possible ways of getting oneself abreast of current literature
Mythili Srinivasan
 
Visualizing science based on open data sources
Nees Jan van Eck
 
University at Albany Lunch and Learn
rachelmccullough
 
Finding Insights in Article-Level Metrics for Research Evaluation
Richard Cave
 
PLoS - Why It is a Model to be Emulated
Philip Bourne
 
Syracuse Lunch and Learn
rachelmccullough
 
Crossref/OASPA Publishers
Crossref
 
Postgraduate orientation 6th june 2017
Debs Martindale
 
Open Bibliography, Citations and Scholarship
benosteen
 
A Strategy for Sharing Your Research: Make Your Work Open Access
Sunghae Ress
 
Web Today, Good Tomorrow? Transactional archiving of web content
Peter Burnhill
 
The role of open access with regards to bibliometrics in the merit and resour...
Gustaf Nelhans
 
Open Access + Preprints for Scholars and Journals
Scholastica
 
Ad

More from University of Bologna (14)

PDF
A Simplified Agile Methodology for Ontology Development
University of Bologna
 
PDF
FOOD: FOod in Open Data
University of Bologna
 
PDF
A pattern-based ontology for describing publishing workflows
University of Bologna
 
PDF
Semantic lenses to bring digital and semantic publishing together
University of Bologna
 
PDF
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
University of Bologna
 
PDF
Characterising citations in scholarly articles: an experiment
University of Bologna
 
PDF
Bringing semantic publishing into TEI: ideas and pointers
University of Bologna
 
PDF
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
University of Bologna
 
PDF
Towards the automatic identification of the nature of citations
University of Bologna
 
KEY
The Live OWL Documentation Environment: a tool for the automatic generation o...
University of Bologna
 
KEY
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
University of Bologna
 
PDF
Embedding semantic annotations within texts: the FRETTA approach
University of Bologna
 
PDF
Dealing with Markup Semantics
University of Bologna
 
PDF
Handling Markup Overlaps Using OWL
University of Bologna
 
A Simplified Agile Methodology for Ontology Development
University of Bologna
 
FOOD: FOod in Open Data
University of Bologna
 
A pattern-based ontology for describing publishing workflows
University of Bologna
 
Semantic lenses to bring digital and semantic publishing together
University of Bologna
 
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
University of Bologna
 
Characterising citations in scholarly articles: an experiment
University of Bologna
 
Bringing semantic publishing into TEI: ideas and pointers
University of Bologna
 
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
University of Bologna
 
Towards the automatic identification of the nature of citations
University of Bologna
 
The Live OWL Documentation Environment: a tool for the automatic generation o...
University of Bologna
 
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
University of Bologna
 
Embedding semantic annotations within texts: the FRETTA approach
University of Bologna
 
Dealing with Markup Semantics
University of Bologna
 
Handling Markup Overlaps Using OWL
University of Bologna
 

Recently uploaded (20)

PPTX
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
PDF
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
PPTX
mirna_2025_clase_genética_cinvestav_Dralvarez
Cinvestav
 
PPTX
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
PPT
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
PDF
Control and coordination Class 10 Chapter 6
LataHolkar
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PDF
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
mirna_2025_clase_genética_cinvestav_Dralvarez
Cinvestav
 
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
Control and coordination Class 10 Chapter 6
LataHolkar
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 

The Initiative for Open Citations and the OpenCitations Corpus

  • 1. Oxford e-Research Centre University of Oxford, UK 9th Conference on Open Access Scholarly Publishing Lisbon, Portugal 20 Sept 2017 © David Shotton 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence [email protected] David Shotton The Initiative for Open Citations and the OpenCitations Corpus
  • 2. 2013 “Free scholarly citation data!” Donatello’s John the Baptist Fifth Conference on Open Access Scholarly Publishing Riga, Latvia 20 September 2013 . . . the voice of one crying in the wilderness
  • 3. 2016 “Release open citation data!” Eighth Conference on Open Access Scholarly Publishing Virginia, USA 20 September 2016 Dario Taraborelli Head of Research, Wikimedia Foundation
  • 4. 2017 The year of success - citation data is freed! n  Two fantastic success stories §  The Initiative for Open Citations https://i4oc.org/ §  The OpenCitations Corpus http://opencitations.net n  While related, these initiatives are separate and distinct n  Two Italian heros: Dario Taraborelli and Silvio Peroni
  • 5. Crossref - providing the fundamental infrastructure https://www.crossref.org/ n  Crossref is the registration agency of Digital Object Identifiers (DOIs) for scholarly publications (journal articles). Most publishers are members n  Crossref hold metadata about articles, made available via its REST API https://www.crossref.org/services/metadata-delivery/rest-api/ n  Crossref has its own heros: Ed Pentz Executive Director Geoff Bilder Director of Strategic Initiatives
  • 6. The Initiative for Open Citations n  The Initiative for Open Citations is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation It does not host citation data! n  Launched April 6, 2017 Web site https://i4oc.org n  Spearheaded by Dario Taraborelli of the Wikimedia Foundation §  with help from Jonathan Dugan, Martin Fenner, Jan Gerlach, Catriona MacCallum, Daniel Mietchen, Cameron Neylon, Mark Patterson, Michelle Paulson, Silvio Peroni and myself n  Six founding organizations: §  The Wikimedia Foundation, PLOS, eLife, DataCite, OpenCitations, and the Centre for Culture and Technology at Curtin University n  Within a short space of time, I4OC has persuaded most of the major scholarly publishers to make their reference lists open, so that the proportion of all references submitted to Crossref that are now open has risen from 1% to over 45%!
  • 7. Publishers supporting I4OC and opening their references n  49 scholarly publishers have opened their references, including the following major ones: n  Commercial publishers §  Association for Computing Machinery, BMJ, De Gruyter, eLife, EMBO Press, Hindawi, IOS Press, PeerJ, Pensoft Publishers, Portland Press, Public Library of Science, Springer Nature, Taylor & Francis, Wiley n  University and scholarly presses §  Cambridge University Press, Cold Spring Harbor Laboratory Press, Company of Biologists, Edinburgh University Press, MIT Press, Rockefeller University Press n  Learned societies §  American Association for the Advancement of Science (AAAS), American Physical Society, American Society for Cell Biology, International Union of Crystallography, Proceedings of the National Academy of Sciences (PNAS), Royal Society of Chemistry, The Royal Society
  • 8. Organizations and institutions who have endorsed I4OC n  Funders §  Sloan Foundation, Bill and Melinda Gates Foundation, Jisc, Simons Foundations Science Sandbox, Wellcome Trust n  Research organizations §  Allen Institute for Artificial Intelligence, Microsoft Research n  Libraries §  Association of Research Libraries, British Library, California Digital Library, Harvard Library Office for Scholarly Communication, LIBER, Max Planck Digital Library n  Bibliographic / bibliometric organizations §  Altmetrics, CiteSeerX, DBLP Computer Science Bibliography, ImpactStory, Zotero n  Other organizations §  Dryad Data Repository, Figshare, Internet Archive, Mozilla, OASPA, Open Knowledge International, OpenAire, ScienceOPEN, Wiki Education Foundation, Wikimedia Deutchland, Wikimedia UK
  • 9. I4OC – what’s left to do n  Almost 50% of Crossref-deposited references, from ~16 million articles, are now open, leaving about half that are still closed n  Crossref has over 7000 members, and it’s the long tail of smaller publisher-members that are not presently opening their references n  This includes a large number of Open Access publishers! §  Just because an article is published as Open Access and its references are available on the publisher’s web site, this is not sufficient for the bulk harvesting and analysis of citation data §  Imagine the effort of going to each site in turn and scraping reference lists presented in a wide variety of differing formats and DTD markups! n  Many small scholarly publishers are not even members of Crossref n  But help is at hand: §  OASPA has a sponsored agreement with Crossref whereby its smaller members can join Crossref via OASPA, with OASPA covering the cost of a proportion of their DOIs
  • 10. How to open references using the Crossref Cited-by service n  The Crossref Cited-by service is a free service that helps publishers find out who is citing their articles n  Publishers submit article reference lists to Crossref along with other metadata n  However, the Crossref default is that these reference lists are closed, not OPEN! n  To open their article reference lists, a publisher needs to do one of two things: §  Either contact [email protected] and ask them to turn on reference distribution for all the DOI prefixes they manage §  Or, in the article metadata they submit to Crossref, set the <reference_distribution_opt> span element to “any” for each DOI deposit where they want to make references openly available n  It’s that easy!!!
  • 11. ZooKeys use of Crossref open citation data
  • 12. The OpenCitations Corpus n  OpenCitations (http://opencitations.net) is a small infrastructure organization directed by myself and Silvio Peroni n  Its primary purpose is to host and develop the OpenCitations Corpus (OCC), a Linked Open Data repository of scholarly bibliographic citation data n  A founding member of I4OC, it is distinct and separate from that initiative n  The first OCC prototype was created at Oxford in 2011 with Jisc funding – see my 2013 COASP talk in Riga (http://zeeba.tv/the-open-citations-corpus/) n  A new instance of the OCC, based on our revised metadata schema, was created by Silvio Peroni and is now running at the University of Bologna n  It has been ingesting scholarly references continuously since early July 2016 n  OCC now provides the largest RDF collection of open citation data on the Web §  Currently holds references from ~240,000 citing bibliographic resources §  Provides >10 million citation links to over 5.5 million cited resources §  These data are freely available under a CC0 public domain waiver
  • 13. Source data - reference lists from PubMed Central n  At present, the ingested reference lists are obtained by processing the XML sources of papers in the Open Access subset of PubMed Central n  These are parsed to yield authors, titles, journal names, etc. §  We ask for the most recent papers first §  Thus, as citing papers, the OCC mainly includes articles published in 2016 and 2017 n  The identifiers of all the citing papers already processed are stored locally, so as not to request the same XML source twice n  We then call several external APIs, including Crossref and ORCID, to obtain additional metadata describing the citing and cited papers and their authors n  There are almost 1.7 million OA articles available in PubMed §  So far we have harvested 14% . . .
  • 14. The raw reference list data n  The reference lists extracted from citing papers are made available in JSON: {
 "doi": "10.1007/s11892-016-0752-4",
 "pmid": "27168063",
 "pmcid": "PMC4863913",
 "localid": "MED-27168063",
 "curator": "BEE EuropeanPubMedCentralProcessor",
 "source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",
 "source_provider": "Europe PubMed Central”
 "references": [
 ... 
 {
 "bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using
 a computational program and its relationship to autoreactive T cells,
 Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, 
 PMID: 19461125",
 "pmid": "19461125",
 "doi": "10.1093/intimm/dxp039",
 "pmcid": "PMC2686615",
 "process_entry": "True”
 },
 ...
 ]
 } The citing paper's metadata and identifiers A reference in the citing paper's reference list, with its own ids
  • 15. The SPAR (Semantic Publishing and Referencing) Ontologies FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for describing bibliographic entities (books, articles, etc.) CiTO, the Citation Typing Ontology - enables the characterization of citations, both factually and rhetorically BiRO, the Bibliographic Reference Ontology - an ontology to define bibliographic records and references, and their compilation into bibliographic collections and reference lists, respectively http://www.sparontologies.net/ n  OCC data are then stored in RDF (JSON-LD) using the SPAR (Semantic Publishing and Referencing) ontologies and other standard vocabularies n  These SPAR ontologies include
  • 16. Availability of the OpenCitations Corpus data n  All the OpenCitations software is available on GitHub under an open license n  The data in the OpenCitations Corpus are available in three different ways: §  Direct access to bibliographic resources by means of their HTTP URIs (via content negotiation), e.g. https://w3id.org/oc/corpus/br/1 §  Queries to our SPARQL endpoint: https://w3id.org/oc/sparql §  Monthly dumps stored in Figshare: http://opencitations.net/download n  Currently the OCC uses a good graph-based triplestore – Blazegraph n  However, the virtual machine that hosts it is very limited in resources, causing performance problems for demanding SPARQL queries n  We plan soon to commission a new powerful physical server that should provide a better user experience, and to develop additional user-friendly interfaces for accessing the OCC data, including graphic visualizations of citation networks
  • 17. Use of the OpenCitations web site n  Accesses to the OpenCitations web site and services: The “corpus” and “sparql” pages have together gained 89% of the total accesses, showing that people mainly access the OpenCitations Corpus to explore and use the data within it
  • 18. Use of OpenCitations data stored on Figshare
  • 19. What happened this summer? n  Use of the OpenCitations social accounts §  Twitter - https://twitter.com/opencitations §  Wordpress Blog – https://opencitations.wordpress.com/ increased markedly following the launch of the Initiative for Open Citations
  • 20. Who is using OpenCitations, and for what? n  Organizations and projects that we know use OpenCitations resources include: §  Wikidata - pulling citation data to enrich their pages §  OpenAIRE – using OCC bibliographic resources info in OpenAIRE §  LOC-DB - have adopted the OpenCitations data model for their database §  Tomas Petricek of the Turing Institute - extending his Gamma Project visualization software to handle OpenCitations’ RDF data §  Ontotext.com - combining Springer's SciGraph data with OpenCitations data using SPARQL federation §  Anna Kamińska of the Polish Librarians Association - undertaking citation network analysis of PLoS One research papers using data in the OCC n  We can’t know who else is using OpenCitations resources unless they tell us! §  Please let us know if you are! n  On 10th September, Crossref blogged about our use of their REST API §  https://www.crossref.org/blog/using-the-crossref-rest-api.-part-5-with- opencitations/
  • 21. Present status of OpenCitations n  We have recently received a small grant from the Sloan Foundation for the OpenCitations Enhancement Project §  This provides one year’s salary for a postdoc to develop new user interfaces, and new hardware to enhance the OCC performance n  We have just appointed Ivan Heibi to work on the OCC with Silvio in Bologna n  Silvio and Ivan will be commissioning the new hardware next month §  This will use parallel processing to increase ingest rate 30-fold n  We are in the process of appointing an International Advisory Board to guide the growth of OpenCitations
  • 22. Enhancing the OpenCitations ingestion rate n  OpenCitations current ingests ~8 million new citations per year n  With 30 Raspberry Pis working in parallel as ingest machines, we anticipate that this rate will increase to ~240 million new citations per year n  By the end of 2018, OpenCitations should hold ~ 250 million citations, compared to Web of Knowledge’s ~1.25 billion n  Even this partial coverage will include citations of all important papers, these critical papers being easily recognized because they are highly cited, forming nodes in the citation graph with a large number of inward citation links n  A further five-fold increase in ingest rate - significant but achievable with additional hardware (and funding!) - will enable us to reach parity by 2020
  • 23. Where will the references come from? n  With the enhanced ingest rate, we will quickly consume all 1.7 million articles in the Open Access Subset of PubMed Central n  We will then start harvesting the references from the ~16 million articles already made open at Crossref in response to the Initiative for Open Citations, and the additional articles that I4OC now encourages other publishers to open n  Possible additional significant sources of open citation data include §  ArXiv (1.3 million preprints) §  CiteSeerX (>120 million references from >6 million documents) §  CitEc (11 million references from a million Economics papers) n  References from pre-digital publications extracted by text mining, e.g. §  In the Social Sciences, from the LOC-DB at the University of Mannheim §  In Biological Taxonomy, mined into BioStor by Rod Page from the Biodiversity Heritage Library, e.g. http://biostor.org/reference/105357
  • 24. We are winning the battle for open scholarship! [email protected] David Shotton Silvio Peroni [email protected] Website: http://opencitations.net Email: [email protected] Twitter: @opencitations Blog: https://opencitations.wordpress.com Website: https://i4oc.org/ Email: [email protected] Twitter: @i4oc_org [email protected] Dario Taraborelli Mark Patterson [email protected] Catriona MacCallum [email protected]