Seamless access to the world’s open
access research papers via
ResourceSync
Petr Knoth
Use Case 1: ResourceSync as a seamless layer over
heterogenous APIs
Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
»Enrichment and
harmonisation of
aggregated data
»Products/services:
›Portal
›API
›Data dumps
›Recommendation
system for libraries
›Repository dashboard
›B2B and analytical
services
Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
»70 million+
metadata records
»Over 6 million full
texts hosted on
CORE
»~1.5 million
monthly active
users
»Aggregating from
2,500 repositories
and 10k OA
journals
Use Case 1: Key issue
Key players do not provide interoperability for machine
access to metadata and content of research papers.
35%
23%
18%
12%
12%
Accessing full-text by
harvesting
the website
Major search
engines
Recongnised
services upon
approval
75%
12%
13%
Restricting access to
full-text
Don't restrict
access in any way
Specify a crawl
delay
Allow access to
specific robots
39%
11%
39%
11%
Reference of an article’s
full-text on metadata
Direct link to full-
text
Interface
supporting full-text
transfer
50%
42%
8%
Accessing content
standards
OAI
Own API
Z39.50
36%
24%
4%
32%
4%
Files format
PDF
HTML
Plain text
HTML
JSON
54%31%
15%
Automated downloads
of OA full-text
Website
API
FTP
Use Case 1: Approach
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol?
Use Case 1: Approach
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol? »Why not OAI-PMH?
›slow and very inefficient
for big repositories.
›Standardised for
metadata transfer but
not for content transfer.
› Very difficult to
represent the richness of
metadata from a broad
range of data providers.
Use Case 1: ResourceSync as a seamless access layer
»Very scalable
implementation on
both the server and
client side
»Interpretation of
metadata happens
using existing pipeline
at the aggregator.
»1.5 million OA
publications from
Elsevier, Springer and
others already
exposed.
»Available at: https://publisher-connector.core.ac.uk/resourcesync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
Use Case 2: Exposing enriched data for Text and Data
Mining (TDM) via ResourceSync
Use Case 2: Subscribing to ResourceSync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
ResourceSync
+ many others
»Other aggregators can
subscribe to the Publisher
connector to make use of their
ingestion pipelines and
enrichment technologies
Use Case 2: Content ingestion in OpenMinTeD
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
A range of bespoke APIs
+ many others
»CORE and OpenAIRE are content sources in the OpenMinTeD
TDM platform (EU infrastructure project) being developed to
enable the mining of scholarly literature.
Use Case 2: Exposing enriched data for TDM
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
»But others want similar solutions … typically, they want to be
able to sync and host the data.
Use Case 3: Make repositories and journals adopt
ResourceSync
Use Case 3: Replace OAI-PMH with ResourceSync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
A range of bespoke APIs
+ many others
ResourceSync
ResourceSync
»Will be a game changer …
»Advocated by COAR Next
Generation Repositories WG
Key contributions and considerations
What’s new about our implementation of ResourceSync?
»Scales to many millions of resources as required by
aggregators (as opposed to existing implementations for
repositories that are scalable for tens of thousands of
resources)
»Real-time updating of ResourceLists and ChangeLists
(avoiding unnecessary batch processes).
»Combination of real-time updates and scalability
Architectural choices
»Based on the principle of changes being communicated
to a controller as they happen (rather than having to be
detected prior to ResourceList/ChangeList updates)
»Uses Elasticsearch as a database
»Hashing mechanism to distribute size of each
ResourceList link and a clever mechanism for iterative
updating of ResourceLists
Conclusions
»ResourceSync:
›broad range of uses in scholarly communication.
›solves problems with aggregating content over OAI-PMH, faster &
more efficient aggregation => fresher data in aggregators compared
to OAI-PMH
»We used ResourceSync to ”liberate” over 1.5 million OA papers (and
growing) from key publishers
»CORE soon to provide access to over 8 million OA full texts via
ResourceSync.
»CORE actively contributes to the adoption of ResourceSync in the
repositories community (as part of OpenMinTeD and COAR NGR)

Seamless access to the world’s open access research papers via ResourceSync

  • 1.
    Seamless access tothe world’s open access research papers via ResourceSync Petr Knoth
  • 2.
    Use Case 1:ResourceSync as a seamless layer over heterogenous APIs
  • 3.
    Use Case 1:What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals.
  • 4.
    Use Case 1:What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals. »Enrichment and harmonisation of aggregated data »Products/services: ›Portal ›API ›Data dumps ›Recommendation system for libraries ›Repository dashboard ›B2B and analytical services
  • 5.
    Use Case 1:What is CORE? OA Repositories OA Journals Mostly OAI-PMH CORE aggregates and provides free access to millions of research articles aggregated from thousands of OA repositories and journals. »70 million+ metadata records »Over 6 million full texts hosted on CORE »~1.5 million monthly active users »Aggregating from 2,500 repositories and 10k OA journals
  • 6.
    Use Case 1:Key issue Key players do not provide interoperability for machine access to metadata and content of research papers. 35% 23% 18% 12% 12% Accessing full-text by harvesting the website Major search engines Recongnised services upon approval 75% 12% 13% Restricting access to full-text Don't restrict access in any way Specify a crawl delay Allow access to specific robots 39% 11% 39% 11% Reference of an article’s full-text on metadata Direct link to full- text Interface supporting full-text transfer 50% 42% 8% Accessing content standards OAI Own API Z39.50 36% 24% 4% 32% 4% Files format PDF HTML Plain text HTML JSON 54%31% 15% Automated downloads of OA full-text Website API FTP
  • 7.
    Use Case 1:Approach OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others Provide seamless access over non-standardised APIs. What protocol?
  • 8.
    Use Case 1:Approach OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others Provide seamless access over non-standardised APIs. What protocol? »Why not OAI-PMH? ›slow and very inefficient for big repositories. ›Standardised for metadata transfer but not for content transfer. › Very difficult to represent the richness of metadata from a broad range of data providers.
  • 9.
    Use Case 1:ResourceSync as a seamless access layer »Very scalable implementation on both the server and client side »Interpretation of metadata happens using existing pipeline at the aggregator. »1.5 million OA publications from Elsevier, Springer and others already exposed. »Available at: https://publisher-connector.core.ac.uk/resourcesync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs + many others ResourceSync
  • 10.
    Use Case 2:Exposing enriched data for Text and Data Mining (TDM) via ResourceSync
  • 11.
    Use Case 2:Subscribing to ResourceSync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector Mostly OAI-PMH A range of bespoke APIs ResourceSync + many others »Other aggregators can subscribe to the Publisher connector to make use of their ingestion pipelines and enrichment technologies
  • 12.
    Use Case 2:Content ingestion in OpenMinTeD OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH OMTD-SHARE (over REST) A range of bespoke APIs + many others »CORE and OpenAIRE are content sources in the OpenMinTeD TDM platform (EU infrastructure project) being developed to enable the mining of scholarly literature.
  • 13.
    Use Case 2:Exposing enriched data for TDM OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH A range of bespoke APIs + many others ResourceSync »But others want similar solutions … typically, they want to be able to sync and host the data.
  • 14.
    Use Case 3:Make repositories and journals adopt ResourceSync
  • 15.
    Use Case 3:Replace OAI-PMH with ResourceSync OA Repositories OA Journals Key publishers (OA + hybrid OA) Publisher connector ResourceSync Mostly OAI-PMH OMTD-SHARE (over REST) A range of bespoke APIs + many others ResourceSync ResourceSync »Will be a game changer … »Advocated by COAR Next Generation Repositories WG
  • 16.
    Key contributions andconsiderations
  • 17.
    What’s new aboutour implementation of ResourceSync? »Scales to many millions of resources as required by aggregators (as opposed to existing implementations for repositories that are scalable for tens of thousands of resources) »Real-time updating of ResourceLists and ChangeLists (avoiding unnecessary batch processes). »Combination of real-time updates and scalability
  • 18.
    Architectural choices »Based onthe principle of changes being communicated to a controller as they happen (rather than having to be detected prior to ResourceList/ChangeList updates) »Uses Elasticsearch as a database »Hashing mechanism to distribute size of each ResourceList link and a clever mechanism for iterative updating of ResourceLists
  • 19.
    Conclusions »ResourceSync: ›broad range ofuses in scholarly communication. ›solves problems with aggregating content over OAI-PMH, faster & more efficient aggregation => fresher data in aggregators compared to OAI-PMH »We used ResourceSync to ”liberate” over 1.5 million OA papers (and growing) from key publishers »CORE soon to provide access to over 8 million OA full texts via ResourceSync. »CORE actively contributes to the adoption of ResourceSync in the repositories community (as part of OpenMinTeD and COAR NGR)