CSC – Suomalainen tutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskus
Research Data Management,
Challenges andTools
Per Öster, CSC – IT Center for Science Ltd
• Drivers
o Number of devices
o Number of communicating apps
o Number of users
2.6.20172
10% of UK
power
consumption
due to ICT
1/3
network
1/3
devices
1/3
datacentres
Amount of data
6/2/173
Analysis
Publication
ReviewConceptualisation
Data
gathering
Open
access
Scientific
blogs Collaborative
bibliographies
Alternative
Reputation
systems
Citizens
science
Open
code
Open
workflows
Open
annotation
Open
data
Pre-
print
Data-
intensive
2!
Sci-
starter.com
Runmycode.
org
ArXiv
Roar.eprints.
org
Impact Story
Altmetric.com
Mendeley.com
Academia.edu
Researchgate.com
Openannotation.org
Datadryad.org
Myexperiment.org
Figshare.com
An#emerging#
ecosystem#of#
services#and#
standards#
It's real!
The DCC Curation
Lifecycle Model
Description and
Representation Information
Preservation Planning
Community Watch and
Participation
Curate and Preserve
Conceptualise
Create or Receive
Appraise and Select
Ingest
Preservation Action
Store
Access, Use and Reuse
Transform
Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand
and render both the digital material and the associated metadata.
Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions.
Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software.
Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle.
Conceive and plan the creation of data, including capture method and storage options.
Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation.
Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.
Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.
Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements.
Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning,
validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.
Store the data in a secure manner adhering to relevant standards.
Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.
Create new data from the original, for example
- By migration into a different format.
- By creating a subset, by selection or query, to create newly derived results, perhaps for publication.
www.dcc.ac.uk
info@dcc.ac.uk
The Curation Lifecycle
The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to
ensure that all necessary stages are undertaken, each in the correct sequence. The model enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement. It can help with
the process of identifying additional steps which may be required, or actions which are not required by certain situations or disciplines, and ensuring that processes and policies are adequately documented.
Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes:
- Simple Digital Objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata.
- Complex Digital Objects are discrete digital objects, made by combining a number of other digital objects, such as websites.
Structured collections of records or data stored in a computer system.
Full Lifecycle Actions
Sequential Actions
Data (Digital Objects or Databases)
Occasional Actions
Dispose
Reappraise
Migrate
Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or
other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction.
Return data which fails validation procedures for further appraisal and reselection.
Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence.
Digital Objects
Databases
2.6.20174
Secure Compute Clouds
Supporting sample
logistics
• Federated Authentication
• Authorization
• Dataset registry
• Data transfer hub
• Policy and Legal Framework
Services and
Coordination
High speed encrypted
data transfer
GridFTP/Globus/Aspera
Secure data access remote API
( GA4GH )
Sequencing centers
Data
Users
EGA
at
Data Archiving
Bringing users
to data
Data Generation
Managing Access
Data Owner
Data Access Agreement
Data Access Committee
Data Request
Authorization Management Tools
( EGA and CSC REMS )
5
From field measurements to open data
2.6.20176




































































































































Questions:



Sensitive data?

Requirements on Authentication and authorization?
Instrument
Measuring PCs:
Raw data
at the stations
File servers at
stations:
Raw data and
field diaries,
cal documents
File servers
in Helsinki:
Raw and
intermediate
data,
documents,
scripts
SMEAR database:
Processed data
in Helsinki
ICOS, EBAS,...
databases:
Near real time
and processed data
outside UH
Routine data processing =
(- unit conversion)
- calibration correction
- quality check, gapfilling
- averaging over space or time
SMEAR
data flow
A/D conversion
unit conversion
IDA (CSC data
service):
Raw data &
document archive,
database datasets
Field
documentation
Researchers,
Data processing
server
Feedback on
data quality
Metadata
Metadata
Metadata
2.6.20177
https://avaa.tdata.fi/web/smart/smear
Support in All Phases of Research Process
20168
Plan
Customer Portal
Experts
Guides
Websites
Training
Service Desk
Produce
& Collect
Data
International
resources
Modelling
Software
Supercomputers
Analyse
Cloud Services
Training
Data science
Computing
Software
Store
B2SAFE
B2SHARE
HPC Archive
IDA
Databases
Research long-
term preservation
(LTP)
Share &
Publish
AVAA
B2DROP
B2SHARE
Databank
Etsin
Funet FileSender
Termination. Company	may	terminate	your	access	to	all	or	any	part	of	the	Service	
at	any	time,	with	or	without	cause,	with	or	without	notice,	effective	immediately,	
which	may	result	in	the	forfeiture	and	destruction	of	all	information	associated	with	
your	account,	including	User	Submissions.	If	you	wish	to	terminate	your	account,	
you	may	do	so	by	following	instructions	available	on	the	Site.	Any	fees	paid	
hereunder	are	non-refundable.	All	provisions	of	the	Terms	of	Use	which	by	their	
nature	should	survive	termination	shall	survive	termination,	including,	without	
limitation,	ownership	provisions,	warranty	disclaimers,	indemnity	and	limitations	of	
liability.
ARTICLE 2: DISCLAIMER
1. The Service is provided "as is" and the Provider disclaims any and all
representations and warranties, whether express or implied, including;- but
not limited to;- implied warranties of title, merchantability, fitness for any
particular purpose or non-infringement. The Provider does not promise any
specific results, effects or outcome from the use of the Service.
2. …
3. The Provider reserves the right to change, reduce, interrupt or discontinue
the Service or parts of it at any time.
4. No one has a right to use the Service; the Provider reserves the right to
exclude certain Users.
Are the commercial services sufficient?
• Nice complement but can not serve as the fundamental infrastructure for research
data of national and international interest
• Need for publicly funded and operated infrastructure
2.6.201712
e-Science	Data	Factory
EUDAT CDI
“The	EUDAT	Collaborative	Data	
Infrastructure is	a	defined	data	model	
and	a	set	of	technical	standards	and	
policies adopted	by	European	research	
data	centres	and	community	data	
repositories	to	create	a	single	European	
e-infrastructure	of	interoperable	data	
services.”
“To	date,	over	20	major	European	research	organizations,	
data	and	computing	centres have	signed	an	agreement	to	
sustain	the	EUDAT	– pan	European	collaborative	data	
infrastructure	for	the	next	10	years	giving	the	birth	to	
the EUDAT	Collaborative	Data	Infrastructure”
www.eudat.eu
Findable
– assign persistent IDs, provide rich metadata, register in a
searchable resource...
Accessible
– Retrievable by their ID using a standard protocol, metadata
remain accessible even if data aren’t...
Interoperable
– Use formal, broadly applicable languages, use standard
vocabularies, qualified references...
Reusable
– Rich, accurate metadata, clear licences, provenance, use of
community standards...
www.force11.org/group/fairgroup/fairprinciples
EUDAT is FAIR
What is the EUDAT Service offer?
B2FIND: multi-disciplinary metadata catalogue
now: common metadata catalogue, harvesting across
all CDI data, single point for data discoverability;
in development: aim to improve with agreed basic
metadata for all data objects.
B2HANDLE: policy-based prefix & PID management
now: common PID mechanism across all CDI data;
in development: aim to improve with agreed common
schema and behaviour.
B2SHARE: research data repository
now: full, tailored metadata support for data deposits.
B2SAFE: policy-driven data management
in development: aim to introduce a common data
model promoting metadata extraction and processing.
EUDAT & Findable
F
B2STAGE: data staging service
B2SHARE: research data repository
B2SAFE: policy-driven data management
now: EUDAT presents data through common Internet protocols and
APIs, http and gridftp;
in development: aim to improve with a single http API for all services
and data.
EUDAT & Accessible
A
B2HANDLE: policy-based prefix & PID management
now: common PID mechanism across all CDI data;
in development: aim to improve with agreed common
schema and behaviour.
B2STAGE: data staging service
B2SHARE: research data repository
B2SAFE: policy-driven data management
in development: single http API for all services and data
(interoperability of data services, if not data!).
B2FIND: multi-disciplinary metadata catalogue
in development: agreed basic metadata for all data
objects (a degree of metadata interop).
EUDAT & Interoperable
I
B2SHARE: research data repository
B2SAFE: policy-driven data management
now: encourage use of CC BY v 3 as common open data licence;
encourage open formats where we have any influence.
EUDAT & Reusable
R
Acknowledgment
• Tommi Nyrönen, ELIXIR-Finland Head of Node
• Mikael Linden, CSC
• TimmoVesala, INAR RI, Helsinki University
• Damien Lecarpentier, EUDAT Project Director
2.6.201721
facebook.com/CSCfi
twitter.com/CSCfi
youtube.com/CSCfi
linkedin.com/company/csc---it-center-for-science
Kuvat CSC:n arkisto ja Thinkstock
CSC – IT Center for Science Ltd
Per Öster
Director, Research Infrastructures
per.oster@csc.fi

Research Data Management, Challenges and Tools - Per Öster

  • 1.
    CSC – Suomalainentutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskus Research Data Management, Challenges andTools Per Öster, CSC – IT Center for Science Ltd
  • 2.
    • Drivers o Numberof devices o Number of communicating apps o Number of users 2.6.20172 10% of UK power consumption due to ICT 1/3 network 1/3 devices 1/3 datacentres Amount of data
  • 3.
    6/2/173 Analysis Publication ReviewConceptualisation Data gathering Open access Scientific blogs Collaborative bibliographies Alternative Reputation systems Citizens science Open code Open workflows Open annotation Open data Pre- print Data- intensive 2! Sci- starter.com Runmycode. org ArXiv Roar.eprints. org Impact Story Altmetric.com Mendeley.com Academia.edu Researchgate.com Openannotation.org Datadryad.org Myexperiment.org Figshare.com An#emerging# ecosystem#of# services#and# standards# It'sreal! The DCC Curation Lifecycle Model Description and Representation Information Preservation Planning Community Watch and Participation Curate and Preserve Conceptualise Create or Receive Appraise and Select Ingest Preservation Action Store Access, Use and Reuse Transform Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand and render both the digital material and the associated metadata. Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions. Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software. Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle. Conceive and plan the creation of data, including capture method and storage options. Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation. Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata. Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements. Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements. Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats. Store the data in a secure manner adhering to relevant standards. Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable. Create new data from the original, for example - By migration into a different format. - By creating a subset, by selection or query, to create newly derived results, perhaps for publication. www.dcc.ac.uk [email protected] The Curation Lifecycle The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to ensure that all necessary stages are undertaken, each in the correct sequence. The model enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement. It can help with the process of identifying additional steps which may be required, or actions which are not required by certain situations or disciplines, and ensuring that processes and policies are adequately documented. Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes: - Simple Digital Objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata. - Complex Digital Objects are discrete digital objects, made by combining a number of other digital objects, such as websites. Structured collections of records or data stored in a computer system. Full Lifecycle Actions Sequential Actions Data (Digital Objects or Databases) Occasional Actions Dispose Reappraise Migrate Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction. Return data which fails validation procedures for further appraisal and reselection. Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence. Digital Objects Databases
  • 4.
  • 5.
    Secure Compute Clouds Supportingsample logistics • Federated Authentication • Authorization • Dataset registry • Data transfer hub • Policy and Legal Framework Services and Coordination High speed encrypted data transfer GridFTP/Globus/Aspera Secure data access remote API ( GA4GH ) Sequencing centers Data Users EGA at Data Archiving Bringing users to data Data Generation Managing Access Data Owner Data Access Agreement Data Access Committee Data Request Authorization Management Tools ( EGA and CSC REMS ) 5
  • 6.
    From field measurementsto open data 2.6.20176 Questions: Sensitive data? Requirements on Authentication and authorization?
  • 7.
    Instrument Measuring PCs: Raw data atthe stations File servers at stations: Raw data and field diaries, cal documents File servers in Helsinki: Raw and intermediate data, documents, scripts SMEAR database: Processed data in Helsinki ICOS, EBAS,... databases: Near real time and processed data outside UH Routine data processing = (- unit conversion) - calibration correction - quality check, gapfilling - averaging over space or time SMEAR data flow A/D conversion unit conversion IDA (CSC data service): Raw data & document archive, database datasets Field documentation Researchers, Data processing server Feedback on data quality Metadata Metadata Metadata 2.6.20177 https://avaa.tdata.fi/web/smart/smear
  • 8.
    Support in AllPhases of Research Process 20168 Plan Customer Portal Experts Guides Websites Training Service Desk Produce & Collect Data International resources Modelling Software Supercomputers Analyse Cloud Services Training Data science Computing Software Store B2SAFE B2SHARE HPC Archive IDA Databases Research long- term preservation (LTP) Share & Publish AVAA B2DROP B2SHARE Databank Etsin Funet FileSender
  • 10.
  • 11.
    ARTICLE 2: DISCLAIMER 1.The Service is provided "as is" and the Provider disclaims any and all representations and warranties, whether express or implied, including;- but not limited to;- implied warranties of title, merchantability, fitness for any particular purpose or non-infringement. The Provider does not promise any specific results, effects or outcome from the use of the Service. 2. … 3. The Provider reserves the right to change, reduce, interrupt or discontinue the Service or parts of it at any time. 4. No one has a right to use the Service; the Provider reserves the right to exclude certain Users.
  • 12.
    Are the commercialservices sufficient? • Nice complement but can not serve as the fundamental infrastructure for research data of national and international interest • Need for publicly funded and operated infrastructure 2.6.201712 e-Science Data Factory
  • 13.
    EUDAT CDI “The EUDAT Collaborative Data Infrastructure is a defined data model and a set of technical standards and policiesadopted by European research data centres and community data repositories to create a single European e-infrastructure of interoperable data services.” “To date, over 20 major European research organizations, data and computing centres have signed an agreement to sustain the EUDAT – pan European collaborative data infrastructure for the next 10 years giving the birth to the EUDAT Collaborative Data Infrastructure” www.eudat.eu
  • 15.
    Findable – assign persistentIDs, provide rich metadata, register in a searchable resource... Accessible – Retrievable by their ID using a standard protocol, metadata remain accessible even if data aren’t... Interoperable – Use formal, broadly applicable languages, use standard vocabularies, qualified references... Reusable – Rich, accurate metadata, clear licences, provenance, use of community standards... www.force11.org/group/fairgroup/fairprinciples EUDAT is FAIR
  • 16.
    What is theEUDAT Service offer?
  • 17.
    B2FIND: multi-disciplinary metadatacatalogue now: common metadata catalogue, harvesting across all CDI data, single point for data discoverability; in development: aim to improve with agreed basic metadata for all data objects. B2HANDLE: policy-based prefix & PID management now: common PID mechanism across all CDI data; in development: aim to improve with agreed common schema and behaviour. B2SHARE: research data repository now: full, tailored metadata support for data deposits. B2SAFE: policy-driven data management in development: aim to introduce a common data model promoting metadata extraction and processing. EUDAT & Findable F
  • 18.
    B2STAGE: data stagingservice B2SHARE: research data repository B2SAFE: policy-driven data management now: EUDAT presents data through common Internet protocols and APIs, http and gridftp; in development: aim to improve with a single http API for all services and data. EUDAT & Accessible A
  • 19.
    B2HANDLE: policy-based prefix& PID management now: common PID mechanism across all CDI data; in development: aim to improve with agreed common schema and behaviour. B2STAGE: data staging service B2SHARE: research data repository B2SAFE: policy-driven data management in development: single http API for all services and data (interoperability of data services, if not data!). B2FIND: multi-disciplinary metadata catalogue in development: agreed basic metadata for all data objects (a degree of metadata interop). EUDAT & Interoperable I
  • 20.
    B2SHARE: research datarepository B2SAFE: policy-driven data management now: encourage use of CC BY v 3 as common open data licence; encourage open formats where we have any influence. EUDAT & Reusable R
  • 21.
    Acknowledgment • Tommi Nyrönen,ELIXIR-Finland Head of Node • Mikael Linden, CSC • TimmoVesala, INAR RI, Helsinki University • Damien Lecarpentier, EUDAT Project Director 2.6.201721
  • 22.
    facebook.com/CSCfi twitter.com/CSCfi youtube.com/CSCfi linkedin.com/company/csc---it-center-for-science Kuvat CSC:n arkistoja Thinkstock CSC – IT Center for Science Ltd Per Öster Director, Research Infrastructures [email protected]