From Open Access to Open Standards, (Linked) Data and Collaborations

From Open Access to Open
Standards, (Linked) Data
and Collaborations
Simeon Warner https://orcid.org/0000-0002-7970-7855
(Director of IT for Library Linked Data and Repository
Architecture, Cornell University Library, USA)
National Library of Finland Kirjastoverkkopäivät (Library
Network Days), Helsinki, Finland, 2017-10-25

How?
•  This was xxx.lanl.gov, now known as
arXiv.org
•  I worked in a narrow field
•  Everyone posted to one place
•  It was a newish field
•  (I was perhaps happy to not read
widely enough)

arXiv submissions
https://arxiv.org/help/stats/2016_by_area/index
New submission rate,
color = subject
Fraction of total rate
for each subject area

What have we learned?
•  Researchers are happy to use e-prints
•  E-print repositories can scale
•  Cost is low ($10-15/article)
•  Some moderation necessary
•  Not very disruptive to journal
publishing (in physics)
Demonstrates substrate for article distribution
supporting overlay, but there has not been
significant adoption of overlap model

All primary (scientific)
research outputs
should be openly
accessible

Why?
Because research will be
done more effectively if
all shoulders are
available to stand on

Preprint tipping point?
•  arXiv “next generation” funding from Sloan
and Heising-Simons foundations
•  BioRxiv finding from Chen-Zuckerberg
•  ASAPbio initiative funded by
Sloan, Moore, Arnold and Simons
foundations
•  ...

Open standards
for repository
data harvesting

Long long ago,
when XML was hard,
Unicode was merely one
possible character set,
a big hard drive was 10GB,
and HotBot & AltaVista
had a new competitor...

... it was1999 and the UPS meeting in
Santa Fe aimed to
“... identify technologies to stimulate
the adoption of the concept of [Open
Access] author self-archived systems in
scholarly communication; theorize a
framework for the integration of e-
print services in the academic
document system ...”
https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm

Thus was born OAI-PMH
v1.0 2001,
v1.1 2002,
v2.0 2003

OAI-PMH was great!
•  It works
•  Scales to millions of items
•  Easy to implement (good s/w libraries)
•  XML, which brought UTF-8 for good
multi-language support (hurrah!)
•  Widely deployed, stable since 2003 (v2.0)
•  Registries & validators
•  Community & documentation

BASE harvests
>5000 sources
>112M documents

Technical deficiencies
•  Not RESTful
•  Repository-centric
•  XML metadata only
•  Metadata is wrapped
•  Dynamic set membership bug

"Currently, OAI-PMH is the only
behavior that is uniformly exposed by
most repositories.
[But], its focus on metadata, its pull-
based paradigm, and its technological
roots that date back to the web of the
nineties put it at odds with ... current
web technologies."
COAR Next Generation Repositories
http://comment.coar-repositories.org/2-next-generation-repositories/

Photo by drivethrucafe CC BY-SA
https://www.flickr.com/photos/128758398@N07/15836296662

Google Scholar
is great, but
not the answer

Replacement with no gap
We need a new approach that:
•  Meets existing OAI-PMH use cases
•  Supports content as well as metadata
•  Scales better
•  Follows web standards
•  Is modern and developer friendly

Push-me pull-you
many items / sources
low latency / efficiency
=> push/notification
modest size
low barrier
=> pull

ResourceSync
ANSI/NISO Z39.99-2017
Sitemaps +
•  multiple sets
•  fixity
•  links
•  changes only
•  dumps
Also supports Notifications (push) as
optional extension

CORE
>6000 journals
>2400 repositories
>77M articles
(>6M full text)
metadata +
content

Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ
Tested with
resync client. 20
x 25MB sitemaps,
1M items ✔

Repository
Harvesting
Conclusions
The repository
community should
agree on a
common new
approach to
harvesting
ResourceSync was
designed to meet

Repository prescription
•  Metadata and content should be web
resources
o  stable URIs, follow web standards, not hidden
behind query interfaces
•  Support ResourceSync as the primary
harvesting interface
o  see e.g.
http://hydrainabox.projecthydra.org/2017/06/22/
resourcesync.html
o  OAI-PMH as secondary where necessary
•  Distinguish and relate metadata and content
entries

Some of my person ids
http://orcid.org/0000-0002-7970-7855
http://www.isni.org/isni/0000000351311901
http://www.researcherid.com/rid/E-2423-2011
https://www.scopus.com/authid/detail.uri?
authorId=7103063073
https://arxiv.org/a/warner_s_1
http://vivo.cornell.edu/display/individual24416
https://github.com/zimeon
http://zimeon.com/me

ORCID
ISNI
VIAF
Scope: 8-20M active,
+2-4M/year ?
Now: 3.2M
Scope: ?M
Now: 9M
Scope: ?M
Now: 6M
Scopes and scales

Why must ORCID be different?
How many people should have ORCID iDs?
o  UNESCO 2013 estimate: 7.8 million researchers
o  OECD 2014 estimate: 25.5 million researchers
o  Average “active lifetime” 3-6 years (guess)
o  Far more than person records in authority systems
How many research and scholarship outputs
should be connected to these ORCID iDs?
o  ~2 million journal articles published per year
(https://arxiv.org/abs/1402.4578)
o  + >> more if notions of scholarly output extend to
data, code, specimens
Ø  “Sort it all out after the fact with manual effort”
solution not practical
Ø  Solve with researcher engagement and use in
publication workflows

ORCID: Open Researcher
and Contributor ID
“ORCID’s vision is a world where all who participate in
research, scholarship, and innovation are uniquely
identified and connected to their contributions across
disciplines, borders, and time.”
“ORCID provides an identifier for individuals to use with
their name as they engage in research, scholarship, and
innovation activities. We provide open tools that enable
transparent and trustworthy connections between
researchers, their contributions, and affiliations. We
provide this service to help people find information and
to simplify reporting and analysis.” (https://orcid.org/)
Ø  Research and scholarship focus
Ø  Expect use by individuals identified in workflows

C1
C3
C2
O1
O4
O2
O3
O5
Contributed-to
Cites
Contributor-Output graph
Generalize:
o  many contributor roles
o  expand “cites” to include other notions of
derivation
o  ++ add organization nodes for affiliation/funding/
etc. (and time dependence)

For full benefit ORCID
needs most researchers
to willingly use their
ORCID iD.

Links to other identities
– leverage overlaps
Biography and
information shown
under my control
... sources indicated
Researcher control
Researcher can choose
what appears on their
record

ORCID iD use
•  7000 journals use ORCID iDs, over
1500 of which require use by
corresponding authors
•  Researcher support from surveys:
o  In 2017 85.9% of respondents now believe
requiring the use of ORCID iDs is
beneficial to the global research
community, compared with 72.2% of 2015
respondents
o  In 2017 83.1% of respondents strongly
agree/agree that ORCID is “essential”,
compared with 48.8% in 2015.

ORCID community
Over 700
members
from 41
countries
https://orcid.org/statistics
3.9m researcher records,
1.5m records with at
least one connection:
24m works, 339K grants, 151K
reviews, 1.9m education and
1.5m employment items
More than 550 integrations
across all sectors of the
research community
Consortia in the UK, Denmark,
Finland, Sweden, Netherlands,
Belgium, Germany, Italy, South Africa,
Taiwan, Australia, New Zealand,
Canada and the US

ORCID Stakeholders, Actions and
Benefits

ORCID
Manuscript
submission
Review
Publication
with ORCID
ORCID
Author(s)
Readers
Reviewers
Automated record update - work
Journal article round trip
ORCID iDs are intended to be integrated into
research and publication workflows, and become
embedded in metadata. Thus ORCID iDs
associated with works when published
Ø  Ambiguity avoidance rather than disambiguation!

Not (quite) the
semantic web
“it is clearly a good idea, and some very
nice demonstrations exist, but it has
not yet changed the world”
[out of context quote from “The Semantic Web” Berners-
Lee, Hendler and Lassila, Scientific American, May 17,
2001]

Linked Data
•  A practical
“sematic web lite”
•  Narrower focus
(“RDF standards” such
as ontologies, SPARQL,
etc. are the gateway to
a more complete
semantic web.)
https://www.w3.org/DesignIssues/LinkedData.html

Why replace
MARC with
Linked Data
formats?

1. MARC is inadequate
MARC continues to meet many needs,
but there are several areas of stress:
•  Translation of record, not descriptions
of appropriate entities
•  Use of text when we want data
•  Limited extensibility
•  Imprecise URI references (record or
RWO?)
•  ...

2. Use identifiers not names
Identifiers provide necessary layer of
indirection that authorized names do
not:
•  Identifiers more easily stable
o  e.g. no change from “Banks, Iain, 1953-” to
“Banks, Iain, 1953-2013”
•  Exact matching
•  URIs make the web work well
•  Does not replace authority ideas, just
makes them work better

3. Connect to the web
“Fortress MARC”
protects and
isolates libraries
from the web
•  Little reuse of
our data
•  Can’t use
standard tools
•  Difficult to
generalize
LibrariesLibraries
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
AM
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A
M
C
R
A

Libraries
Web
The web is big ...
... most of
our users
spend most
of their time
there
[not to scale]

BIBFRAME & related ontologies
BIBFRAME2.0BIBFRAME1.0 BIBFRAME3.x? BIBFRAME4.x?
bﬂc extension
bibliotek-o
…others…
ArtFrame
RareMat
???
Community adoption & revision?
LD4L critique
NOW
{Extensions
Time

LD4L & LD4L Labs
Cornell, Harvard, Stanford, Iowa; 2014-2016
•  Conversion of MARC -> BIBFRAME at scale (~30M
records, ~3billion triples )
•  Blacklight-based search over combined catalogs
•  Ontology work around “LD4L ontology” which
provided significant input for BIBFRAME2.0
•  Support use of linked data authorities in the Hydra
stack via Questioning Authority gem
2016-2018
•  bibliotek-o ontology
•  Data conversion MARC & non-MARC to LD
•  VitroLib editor
•  Authority infrastructure and UI refinement including
context
https://ld4l.org/ld4l-labs/

LD4P – ... for Production
Columbia, Cornell, Harvard, LC, Princeton, Stanford –
2016-2018
•  Develop extension ontologies for
BIBFRAME2.0/bibliotek-o (ArtFrame,
Cartographic, Moving Image, Performed
Music, & Rare Materials)
•  Pilot transition of technical services
workflows to a linked data environment
o  copy cataloging
o  original cataloging
(“production” in LD4P means creation of catalog
records, not production-ready)
https://ld4l.org/ld4p

BIBFLOW (UCDavis, 2014-2016)
https://goo.gl/vwUiJY
Conservative
suggestion:
•  add URIs first
•  establish 2-way
conversions for
import/export

National Library of Finland
•  MARC to BIBFRAME to schema.org
•  Focus on web publication, hence
schema.org
http://swib.org/swib16/slides/suominen_silos.pdf

How close are we
to linked data
catalogs?

Let’s not forget utility
“Catalogers are primarily concerned
about the quality and consistency of the
data they produce, while technologists
are primarily concerned with the
techniques and tools that can be used to
manipulate it.”
[Jeff Edmunds,
https://scholarsphere.psu.edu/concern/generic_works/44558d45t ]

Discovery
system
ILS
(bib, holdings,
auth, circ)
MARC to LD
Datasharingbetweenlibraries
circ
LD cooperative
and vendor
sources
Browse and
explore with
context
Reconciliation
Lookup tools
(with
reconciliation)
Local LD
authorities
LD editors
LC marc2bibframe
LD4L Labs bib2lod
Blacklight with
LD extensions
LD4L Labs
VitroLib,
LC BFEdit
CEDAR
Vitro /
Triplestore
Non-library
web data
sources
Manual, automated and
semi-supervised
reconciliation tools
& practices
Web-based context:
Wikidata, DBpedia,
etc.
Web-scale
search
Analysis and
validation W3C SHACL
LD4L Labs
validation
OCLC schema,
LC pilots
schema.org
Authorities
with LD
descriptions
id.loc.gov, LC FAST,
VIAF, ORCID, Getty,
etc…
context data
users
Linked Data catalog ecosystem

Data
modeling &
proﬁle
creation
Community
review and
discussion
Tool
building
Cataloging
and
conversion
Community
review and
discussion
Community
review and
discussion
Data use
(discovery)
End user
evaluation
Community
review and
discussion
Catalog system feedback cycles

Open
Collaborations
(around software)

“Over The Wall”
•  Simply make a copy of the source
code available
•  Exemplified by many uses of
SourceForge (though has more
features)
•  Sharing but not collaboration
... better than not sharing

Open Development
•  and related: “Social Coding”
•  Share changes as they are made and
provide means of contact/input
•  Exemplified by basic use of GitHub
(other services too)
•  License for re-use
better than
“Over The Wall”

Community Development
•  aka “Community Source
Software”
•  Multiple parties working
together toward shared
goals
•  Norms
•  Coordination
•  Governance
https://commons.wikimedia.org/wiki/File:Tux.svg
https://commons.wikimedia.org/wiki/File:Apache_Software_Foundation_Logo_(2016).svg
Apache 2.0 License
Home in
Helsinki !

Samvera (formerly Hydra)
•  Framework and “solution bundles” for
repository and DAM systems
•  Blacklight/Solr + Fedora + Ruby
•  30+ partner institutions
•  Vibrant and supportive community
•  Yearly conference and other meetings
•  Training
•  Currently considering stronger
governance options
https://samvera.org/

International Image
Interoperability Framework
“A community of the world’s leading libraries
and image repositories working to produce a
community framework and interoperable
technology for image delivery.”
•  Primary outputs are specifications, software
developed by sub-groups
•  IIIF Consortium formed in 2015 to support
growth and adoption
o  > 40 members, growing rapidly
o  Memberships pay for staff (2)
o  Libraries, museums, galleries, vendors
http://iiif.io/

Final thoughts
Most of interesting big challenges
require collaboration to realize,
including the ones I’ve mentioned:
•  opening access to scholarly literature,
making it discoverable, and linking
researchers to their contributions
•  moving to the next generation of
library catalogs better integrated with
the web

Kiitos!
@zimeon
simeon.warner@cornell.edu

From Open Access to Open Standards, (Linked) Data and Collaborations

More Related Content

What's hot

Similar to From Open Access to Open Standards, (Linked) Data and Collaborations

More from Simeon Warner

Recently uploaded

From Open Access to Open Standards, (Linked) Data and Collaborations