Data Management
for Research
Aaron Collie, MSU Libraries
Lisa Schmidt, University Archives
Introductions
 Please tell us your name and
department
 A brief description of your
primary research area
 What do you consider to be
your research data
 Experience and/or comfort
level with managing research
data?
cc http://www.flickr.com/photos/quinnanya/
Data Management. Isn’t that… trivial?
 Not so much. Data is a primary output of research; it is very
expensive to produce high quality data. Data may be collected
in nanoseconds, but it takes the expert application of
research protocol and design to generate data.
CC-BY-SA-3.0 Rob Lavinsky CC-BY-SA-3.0 Rob
 Even more consequential, data is the input of a
process that generates higher orders of
understanding.
Wisdom
Knowledge
Information
Data
Understanding
is hierarchical!
Russell Ackoff
This is the engine of the academic industry…
So, things can get a little messy.
The scientific method “is
often misrepresented as a
fixed sequence of steps,”
rather than being seen for
what it truly is, “a highly
variable and creative
process” (AAAS 2000:18).
Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)
The Research Depth Chart
Scientific Method
Research Design
Research Method
Research Tasks
MoreSpecificMoreGeneric
Problem
Identification
Study Concept
Literature
Review
Environmental
Scan
Funding &
Proposal
Research
Design
Research
Methodology
Research
Workflow
Hypothesis
Formation
Design
Validation
Research
Activity
Data
Management
Data
Organization
Data
Storage
Data
Description
Data Sharing
Scholarly
Communication
Report
Findings
Publish
Peer Review
Problem
Identification
Study Concept
Literature
Review
Environmental
Scan
Funding &
Proposal
Research
Design
Research
Methodology
Research
Workflow
Hypothesis
Formation
Design
Validation
Research
Activity
Data
Management
Data
Organization
Data
Storage
Data
Description
Data Sharing
Scholarly
Communication
Report
Findings
Publish
Peer Review
Upfront Decisions for Researchers
 How are the data described and organized?
 Who are the expected and potential audiences for
the datasets?
 What publications or discoveries have resulted from
the datasets?
 How should the data be made accessible?
 How might the data be used, reused, and
repurposed?
Upfront Decisions for Researchers
 What is the expected lifespan of the data?
 Besides the researcher(s) on the project, who else
should be given access to the data?
 Does the dataset include any sensitive information?
 Who owns or controls the research data?
 Should any restrictions be placed on the dataset?
 How are the data stored and preserved?
• Introduction
• Background
• The Impetus: NSF Data Management Plan Mandate
• The Effect: Policy to Practice
• The Response: Changing Data Landscape
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Publishing, Sharing, & Reuse
• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
But why are we really here?
 Impetus: NSF has mandated that all grant applications
submitted after January 18th, 2011 must include a
supplemental “Data Management Plan”
 Effect: The original NSF mandate has had a domino effect, and
many funders now require or state guidelines for data
management of grant funded research
 Response: Data management has not traditionally received a
full treatment in (many) graduate and doctoral curricula;
intervention is necessary
Impetus: NSF Data Management Plan
 Policies for re-use, re-distribution, and creation of derivatives
 Plans for archiving data, samples, and other research
outcomes, maintaining access
 Types of data, samples, physical collections, software
generated
 Standards for data and metadata format and content
 Access and sharing policies, with stipulations for
privacy, confidentiality, security, intellectual property, or other
rights or requirements
Impetus: NSF Data Management Plan
 NSF will not evaluate any proposal missing a DMP
 PI may state that project will not generate data
 DMP is reviewed as part of intellectual merit or
broader impacts of application, or both
 Costs to implement DMP may be included in
proposal’s budget
 May be up to two pages long
Effect: Funder Policies
NASA “promotes the full and open sharing of all data”
“requires that data…be submitted to and archived by
designated national data centers.”
“expects the timely release and sharing of final research
data"
"IMLS encourages sharing of research data."
“…should describe how the project team will manage
and disseminate data generated by the project”
Effect: More is on the way
 Presidential Memorandum on Managing Government
Records (August 24, 2012)
• Managing Government Records Directive: All permanent electronic
records in Federal agencies will be managed electronically to the
fullest extent possible for eventual transfer and accessioning by
NARA in an electronic format.
 White House policy memo (February 22, 2013)
• Increasing Access to the Results of Federally Funded Scientific
Research: Federal agencies with more than $100M in R&D
expenditures must develop plans to make the published results of
federally funded research freely available to the public within one
year of publication.
Effect: Local Policy
 University Research Council Best Practices
Research Data: Management, Control, and Access
• To assure that research data are appropriately
recorded, archived for a reasonable period of time, and
available for review under the appropriate
circumstances.
– Ownership = MSU
– “Stewardship” = You
– Period of Retention = 3 years
– Transfer of Responsibility = Written Request
Response: Changing Data Landscape
 Data Management Competencies
 Standards & Best Practices
 Discipline Specific Discourse
 Data sharing and open data
 Data sets as publications
 Data journals
 Citations for data (e.g., used in secondary analysis)
 Data as supplementary materials to traditional articles
 Data repositories and archives
Data Sharing Impacts
 Reinforces open scientific
inquiry
 Encourages diversity of
analysis and opinion
 Promotes new
research, testing of new
or alternative
hypotheses and methods
of analysis
 Supports studies on data
collection methods and
measurement
Cc http://www.flickr.com/photos/pinchof_10/
Data Sharing Impacts
 Facilitates education of
new researchers
 Enables exploration of
topics not envisioned
by initial investigators
 Permits creation of
new datasets by
combining data from
multiple sources
• Introduction
• Background
• The Impetus: NSF Data Management Plan Mandate
• The Effect: Policy to Practice
• The Response: Changing Data Landscape
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Publishing, Sharing, & Reuse
• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
File Organization Practices: Overview
1. Design a file plan for
your research project
2. Use file naming
conventions that work
for your project
3. Choose file formats to
maximize usefulness
“When I was a
freshmen I named
my assignments
Paper Paperr
Paperrr Paperrrr”
-Undergrad
Design a File Plan
 File structure is the framework
 Classification system makes it easier to locate
folders/files
 Benefits:
 Simple organization intuitive to team members
and colleagues
 Reduces duplicate copies in personal drives and e-
mail attachments
Design a File Plan
 Choose a sortable directory hierarchy
 Example 1: Investigator, Process, Date
Collie
TEI_Encoding
20110117
 Example 2: Instrument, Date, Sample
Usability Survey
20120430
Sample 1
Design a File Plan
Example documentation of Directory Hierarchy:
/[Project]/[Grant Number]/[Event]/[Investigator/Date]
Use File Naming Conventions
 Why file naming conventions?
 Enable better access/retrieval of files
 Create logical sequences for file sorting
 More easily identify what you’re searching for
 Meaningful but short—255 character limit
 Use alphanumeric characters
 Example: abc123
 Capital letters or underscores differentiate
between words
 Surname first followed by initials of first name
Use File Naming Conventions
 Year-month-day format for dates, with or without
hyphens
 Example 1: 2006-03-13
 Example 2: 20060313
 Decide on a simple versioning method
 Example: file_v001
Use File Naming Conventions
 To create consistent file names, specify a
template such as:
[investigator]_[descriptor]_[YYYYMMDD].[ext]
Use File Naming Conventions
This Not This
sharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif
This Not This
borgesJ_collocation_20080414.xml Borges_Textbase.xml
Choose Appropriate File Formats
• Non-proprietary
• Open, documented standard
• Common usage by research community
• Standard representation (ASCII, Unicode)
• Unencrypted
• Uncompressed
Choose Appropriate File Formats
Format Genre Optimal Standards
TEXT .txt; .odt; .xml; .html
AUDIO .flac; .wav,
VIDEO .mp2/.mp4; .mkv
IMAGE .tif; .png; .svg; .jpg
DATA .sql; .csv
Documentation Practices: Overview
 Even researchers require proper documentation
to decipher or reuse their datasets
 Documentation = accessible, intelligible datasets
Documentation Practices: Overview
1. At minimum create a
README file that you can
use to document your
project
2. Utilize standards for
describing data including
Metadata Standards
3. If applicable, use in-line
code commentary to
explain code
(cc) Will Scullin
Create a README file
 At minimum, store documentation in readme.txt
file or equivalent, with data
Create a README file
 Significant documentation about dataset
 What data consists of
 How it was collected
 Restrictions to distribution or use
 Other descriptive information
 “Data about data”
 Standardized way of describing data
 Explains who, what, where, when of data
creation and methods of use
 Data more easily found
 Data more easily compared to other data sets
Use Metadata Standards
Use Metadata Standards
Basic project metadata:
• Title • Language • File Formats
• Creator • Dates • File Structure
• Identifier • Location • Variable List
• Subject • Methodology • Code Lists
• Funders • Data Processing • Versions
• Rights • Sources • Checksums
• Access
Information
• List of File Names
Use Metadata Standards
 Dublin Core: Commonly-used descriptive metadata
format facilitates dataset discovery across the Web.
 Data Documentation Initiative (DDI): Defines
metadata content, presentation, transport, and
preservation for the social and behavioral sciences.
 ISO 19115:2003: Describes geographic data such as
maps and charts.
 More
examples:http://www.lib.msu.edu/about/diginfo/coll
ect.jsp
Use In-Line Code Commentary
Example of R code commentary
# Cumulative normal density
pnorm(c(-1.96,0,1.96))
 If applicable, in-line code commentary helps
explain code
Backup Practices: Overview
 Data at significant risk of loss without storage
and backup plan, including:
 Hardware / network failures
 Bit rot
 Human error
 Singular commercial grade hard drives
 Effective data storage plan provides for:
 Primary authoritative copy
 Secondary local backup
 Tertiary remote backup
Backup Practices: Overview
1. Avoid single points of failure
2. Ensure data redundancy & replication
3. Understand the common types of storage
Avoid Single Points of Failure
A single point of failure occurs when it would only
take one event to destroy all data on a device
 Use managed networked storage when possible
 Move data off of portable media
 Never rely on one copy of data
 Do not rely on CD or DVD copies to be readable
 Be wary of software lifespans
Ensure Data Redundancy
Backup Do’s:
 Make 3 copies
 E.g. original + external/local + external/remote
 E.g. original + 2 formats on 2 drives in 2 locations
 Geographically distribute and secure
 Local vs. remote, depending on needed recovery time
 Personal computer, external hard
drives, departmental, or university servers may
be used
Ensure Data Redundancy
Backup Don’ts:
 Do not rely on one copy
 Do not use CDs and DVDs
 Do not rely on ANGEL or
Desire2Learn
(cc) George Ornbo
Ensure Data Redundancy
Backup Maybe:
 Cloud storage
 Amazon s3
 Google
 MS Azure
 DuraCloud
 Rackspace
 Glacier
Note that many
enterprise cloud
storage services
include a charge for
in/out of data
transfers
$$$
Understand Common Types of Storage
• Optical Media
• Portable Flash Media
• Commercial Hard Drives
• Commercial NAS
• Cloud Storage
• Enterprise Network Storage
• Trusted Archival Storage
Understand Common Types of Storage
• Features of storage types:
• Portable data transfers
• Short-term storage
• Project term storage
• Networked data transfer
• Long-term storage
• Reliable backup option
Understand Common Types of Storage
• Enterprise storage at MSU
• AFS Storage (Free up to 1 GB)
• Fee based
• Individual Storage
• Mid-Tier Storage
• Free up to 1TB
• HPCC Home Directory
• HPCC Research Directory
Data Publishing, Sharing, Reuse: Overview
1. Prepare data in suitable
format, for a potentially
high return on investment
2. Publish data in several
data publication venues
to more broadly share
results of research
Research datasets becoming first-class
scholarly contributions on par with peer-
reviewed journal articles
Sharing & Publishing Data
• Data preparation for sharing and publication is a
time-intensive process
• Potential positive outcomes:
• Increased research impact and citations
• Enable additional scientific inquiry
• Opportunities for co-authorship and collaboration
• Enhance your grant proposal’s competitiveness
Data Publication Venues
• Multiple ways to publish research data
• Faculty or project website
• Journal supplementary materials
• Disciplinary data repository (data archive)
• Varying levels of support for indexing, access
controls, and long-term curation
Data Publication Venues
• Disciplinary Data Repository
• Securely share data, ensure long-term access
• High visibility
• Often offer persistent citations
• Availability varies across domains
• Databib.org directory
Protecting Data & Responsible Reuse
1. Consider how to protect
data and intellectual
property rights while
encouraging reuse
2. Keep in mind ethical
concerns when sharing
data
(cc) Will Scullin
Intellectual Property
• IP refers to exclusive rights of creators of works
• Individual data cannot be protected by US
copyright
• Organization of data such as database, creative
work produced by data, and research instruments
used may be protected
Intellectual Property
• Principal investigator’s institution holds IP rights
• Provide clearly stated license for producing
derivatives, reusing, and redistributing datasets
• License under Creative Commons
• State if any restrictions or embargos on use
• Provide example of how work should be cited to
encourage proper attribution on reuse
• Document any IP / copyright issues
Ethics & Data Sharing
• Keep in mind the following ethical concerns when
sharing your data:
• Privacy
• Confidentiality
• Security and integrity of the data
• For data involving human subjects, obtain written
permission or consent stating how the data may
be reused
Best Practices = High Impact Data
• File organization ensures easier access and
retrieval of data
• Documentation makes datasets accessible and
intelligible to users
• Storage and backup safeguards data
• Data publishing and sharing encourages the most
widespread reuse of data
• Data protection ensures responsible reuse
• Introduction
• Background
• The Impetus: NSF Data Management Plan Mandate
• The Effect: Policy to Practice
• The Response: Changing Data Landscape
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Publishing, Sharing, & Reuse
• Protecting Data & Responsible Reuse
• Data Lifecycle Resources
Agenda
http://www.lib.msu.edu/rdmg
Contact
Lisa M. Schmidt
Electronic Records Archivist
University Archives & Historical Collections
lschmidt@ais.msu.edu
Aaron Collie
Digital Curation Librarian
MSU Libraries
collie@msu.edu

Data Management for Research (New Faculty Orientation)

  • 1.
    Data Management for Research AaronCollie, MSU Libraries Lisa Schmidt, University Archives
  • 2.
    Introductions  Please tellus your name and department  A brief description of your primary research area  What do you consider to be your research data  Experience and/or comfort level with managing research data? cc http://www.flickr.com/photos/quinnanya/
  • 3.
    Data Management. Isn’tthat… trivial?  Not so much. Data is a primary output of research; it is very expensive to produce high quality data. Data may be collected in nanoseconds, but it takes the expert application of research protocol and design to generate data. CC-BY-SA-3.0 Rob Lavinsky CC-BY-SA-3.0 Rob
  • 4.
     Even moreconsequential, data is the input of a process that generates higher orders of understanding. Wisdom Knowledge Information Data Understanding is hierarchical! Russell Ackoff
  • 5.
    This is theengine of the academic industry…
  • 7.
    So, things canget a little messy.
  • 8.
    The scientific method“is often misrepresented as a fixed sequence of steps,” rather than being seen for what it truly is, “a highly variable and creative process” (AAAS 2000:18). Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)
  • 10.
    The Research DepthChart Scientific Method Research Design Research Method Research Tasks MoreSpecificMoreGeneric
  • 11.
  • 12.
  • 13.
    Upfront Decisions forResearchers  How are the data described and organized?  Who are the expected and potential audiences for the datasets?  What publications or discoveries have resulted from the datasets?  How should the data be made accessible?  How might the data be used, reused, and repurposed?
  • 14.
    Upfront Decisions forResearchers  What is the expected lifespan of the data?  Besides the researcher(s) on the project, who else should be given access to the data?  Does the dataset include any sensitive information?  Who owns or controls the research data?  Should any restrictions be placed on the dataset?  How are the data stored and preserved?
  • 15.
    • Introduction • Background •The Impetus: NSF Data Management Plan Mandate • The Effect: Policy to Practice • The Response: Changing Data Landscape • Fundamentals Practices • File Organization • Data Documentation • Reliable Backup • Data Publishing, Sharing, & Reuse • Protecting Data & Responsible Reuse • Data Lifecycle Resources Agenda
  • 16.
    But why arewe really here?  Impetus: NSF has mandated that all grant applications submitted after January 18th, 2011 must include a supplemental “Data Management Plan”  Effect: The original NSF mandate has had a domino effect, and many funders now require or state guidelines for data management of grant funded research  Response: Data management has not traditionally received a full treatment in (many) graduate and doctoral curricula; intervention is necessary
  • 17.
    Impetus: NSF DataManagement Plan  Policies for re-use, re-distribution, and creation of derivatives  Plans for archiving data, samples, and other research outcomes, maintaining access  Types of data, samples, physical collections, software generated  Standards for data and metadata format and content  Access and sharing policies, with stipulations for privacy, confidentiality, security, intellectual property, or other rights or requirements
  • 18.
    Impetus: NSF DataManagement Plan  NSF will not evaluate any proposal missing a DMP  PI may state that project will not generate data  DMP is reviewed as part of intellectual merit or broader impacts of application, or both  Costs to implement DMP may be included in proposal’s budget  May be up to two pages long
  • 19.
    Effect: Funder Policies NASA“promotes the full and open sharing of all data” “requires that data…be submitted to and archived by designated national data centers.” “expects the timely release and sharing of final research data" "IMLS encourages sharing of research data." “…should describe how the project team will manage and disseminate data generated by the project”
  • 20.
    Effect: More ison the way  Presidential Memorandum on Managing Government Records (August 24, 2012) • Managing Government Records Directive: All permanent electronic records in Federal agencies will be managed electronically to the fullest extent possible for eventual transfer and accessioning by NARA in an electronic format.  White House policy memo (February 22, 2013) • Increasing Access to the Results of Federally Funded Scientific Research: Federal agencies with more than $100M in R&D expenditures must develop plans to make the published results of federally funded research freely available to the public within one year of publication.
  • 21.
    Effect: Local Policy University Research Council Best Practices Research Data: Management, Control, and Access • To assure that research data are appropriately recorded, archived for a reasonable period of time, and available for review under the appropriate circumstances. – Ownership = MSU – “Stewardship” = You – Period of Retention = 3 years – Transfer of Responsibility = Written Request
  • 22.
    Response: Changing DataLandscape  Data Management Competencies  Standards & Best Practices  Discipline Specific Discourse  Data sharing and open data  Data sets as publications  Data journals  Citations for data (e.g., used in secondary analysis)  Data as supplementary materials to traditional articles  Data repositories and archives
  • 23.
    Data Sharing Impacts Reinforces open scientific inquiry  Encourages diversity of analysis and opinion  Promotes new research, testing of new or alternative hypotheses and methods of analysis  Supports studies on data collection methods and measurement Cc http://www.flickr.com/photos/pinchof_10/
  • 24.
    Data Sharing Impacts Facilitates education of new researchers  Enables exploration of topics not envisioned by initial investigators  Permits creation of new datasets by combining data from multiple sources
  • 25.
    • Introduction • Background •The Impetus: NSF Data Management Plan Mandate • The Effect: Policy to Practice • The Response: Changing Data Landscape • Fundamentals Practices • File Organization • Data Documentation • Reliable Backup • Data Publishing, Sharing, & Reuse • Protecting Data & Responsible Reuse • Data Lifecycle Resources Agenda
  • 26.
    File Organization Practices:Overview 1. Design a file plan for your research project 2. Use file naming conventions that work for your project 3. Choose file formats to maximize usefulness “When I was a freshmen I named my assignments Paper Paperr Paperrr Paperrrr” -Undergrad
  • 27.
    Design a FilePlan  File structure is the framework  Classification system makes it easier to locate folders/files  Benefits:  Simple organization intuitive to team members and colleagues  Reduces duplicate copies in personal drives and e- mail attachments
  • 28.
    Design a FilePlan  Choose a sortable directory hierarchy  Example 1: Investigator, Process, Date Collie TEI_Encoding 20110117  Example 2: Instrument, Date, Sample Usability Survey 20120430 Sample 1
  • 29.
    Design a FilePlan Example documentation of Directory Hierarchy: /[Project]/[Grant Number]/[Event]/[Investigator/Date]
  • 30.
    Use File NamingConventions  Why file naming conventions?  Enable better access/retrieval of files  Create logical sequences for file sorting  More easily identify what you’re searching for
  • 31.
     Meaningful butshort—255 character limit  Use alphanumeric characters  Example: abc123  Capital letters or underscores differentiate between words  Surname first followed by initials of first name Use File Naming Conventions
  • 32.
     Year-month-day formatfor dates, with or without hyphens  Example 1: 2006-03-13  Example 2: 20060313  Decide on a simple versioning method  Example: file_v001 Use File Naming Conventions
  • 33.
     To createconsistent file names, specify a template such as: [investigator]_[descriptor]_[YYYYMMDD].[ext] Use File Naming Conventions This Not This sharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif This Not This borgesJ_collocation_20080414.xml Borges_Textbase.xml
  • 34.
    Choose Appropriate FileFormats • Non-proprietary • Open, documented standard • Common usage by research community • Standard representation (ASCII, Unicode) • Unencrypted • Uncompressed
  • 35.
    Choose Appropriate FileFormats Format Genre Optimal Standards TEXT .txt; .odt; .xml; .html AUDIO .flac; .wav, VIDEO .mp2/.mp4; .mkv IMAGE .tif; .png; .svg; .jpg DATA .sql; .csv
  • 36.
    Documentation Practices: Overview Even researchers require proper documentation to decipher or reuse their datasets  Documentation = accessible, intelligible datasets
  • 37.
    Documentation Practices: Overview 1.At minimum create a README file that you can use to document your project 2. Utilize standards for describing data including Metadata Standards 3. If applicable, use in-line code commentary to explain code (cc) Will Scullin
  • 38.
    Create a READMEfile  At minimum, store documentation in readme.txt file or equivalent, with data
  • 39.
    Create a READMEfile  Significant documentation about dataset  What data consists of  How it was collected  Restrictions to distribution or use  Other descriptive information
  • 40.
     “Data aboutdata”  Standardized way of describing data  Explains who, what, where, when of data creation and methods of use  Data more easily found  Data more easily compared to other data sets Use Metadata Standards
  • 41.
    Use Metadata Standards Basicproject metadata: • Title • Language • File Formats • Creator • Dates • File Structure • Identifier • Location • Variable List • Subject • Methodology • Code Lists • Funders • Data Processing • Versions • Rights • Sources • Checksums • Access Information • List of File Names
  • 42.
    Use Metadata Standards Dublin Core: Commonly-used descriptive metadata format facilitates dataset discovery across the Web.  Data Documentation Initiative (DDI): Defines metadata content, presentation, transport, and preservation for the social and behavioral sciences.  ISO 19115:2003: Describes geographic data such as maps and charts.  More examples:http://www.lib.msu.edu/about/diginfo/coll ect.jsp
  • 43.
    Use In-Line CodeCommentary Example of R code commentary # Cumulative normal density pnorm(c(-1.96,0,1.96))  If applicable, in-line code commentary helps explain code
  • 44.
    Backup Practices: Overview Data at significant risk of loss without storage and backup plan, including:  Hardware / network failures  Bit rot  Human error  Singular commercial grade hard drives  Effective data storage plan provides for:  Primary authoritative copy  Secondary local backup  Tertiary remote backup
  • 45.
    Backup Practices: Overview 1.Avoid single points of failure 2. Ensure data redundancy & replication 3. Understand the common types of storage
  • 46.
    Avoid Single Pointsof Failure A single point of failure occurs when it would only take one event to destroy all data on a device  Use managed networked storage when possible  Move data off of portable media  Never rely on one copy of data  Do not rely on CD or DVD copies to be readable  Be wary of software lifespans
  • 47.
    Ensure Data Redundancy BackupDo’s:  Make 3 copies  E.g. original + external/local + external/remote  E.g. original + 2 formats on 2 drives in 2 locations  Geographically distribute and secure  Local vs. remote, depending on needed recovery time  Personal computer, external hard drives, departmental, or university servers may be used
  • 48.
    Ensure Data Redundancy BackupDon’ts:  Do not rely on one copy  Do not use CDs and DVDs  Do not rely on ANGEL or Desire2Learn (cc) George Ornbo
  • 49.
    Ensure Data Redundancy BackupMaybe:  Cloud storage  Amazon s3  Google  MS Azure  DuraCloud  Rackspace  Glacier Note that many enterprise cloud storage services include a charge for in/out of data transfers $$$
  • 50.
    Understand Common Typesof Storage • Optical Media • Portable Flash Media • Commercial Hard Drives • Commercial NAS • Cloud Storage • Enterprise Network Storage • Trusted Archival Storage
  • 51.
    Understand Common Typesof Storage • Features of storage types: • Portable data transfers • Short-term storage • Project term storage • Networked data transfer • Long-term storage • Reliable backup option
  • 52.
    Understand Common Typesof Storage • Enterprise storage at MSU • AFS Storage (Free up to 1 GB) • Fee based • Individual Storage • Mid-Tier Storage • Free up to 1TB • HPCC Home Directory • HPCC Research Directory
  • 53.
    Data Publishing, Sharing,Reuse: Overview 1. Prepare data in suitable format, for a potentially high return on investment 2. Publish data in several data publication venues to more broadly share results of research Research datasets becoming first-class scholarly contributions on par with peer- reviewed journal articles
  • 54.
    Sharing & PublishingData • Data preparation for sharing and publication is a time-intensive process • Potential positive outcomes: • Increased research impact and citations • Enable additional scientific inquiry • Opportunities for co-authorship and collaboration • Enhance your grant proposal’s competitiveness
  • 55.
    Data Publication Venues •Multiple ways to publish research data • Faculty or project website • Journal supplementary materials • Disciplinary data repository (data archive) • Varying levels of support for indexing, access controls, and long-term curation
  • 56.
    Data Publication Venues •Disciplinary Data Repository • Securely share data, ensure long-term access • High visibility • Often offer persistent citations • Availability varies across domains • Databib.org directory
  • 57.
    Protecting Data &Responsible Reuse 1. Consider how to protect data and intellectual property rights while encouraging reuse 2. Keep in mind ethical concerns when sharing data (cc) Will Scullin
  • 58.
    Intellectual Property • IPrefers to exclusive rights of creators of works • Individual data cannot be protected by US copyright • Organization of data such as database, creative work produced by data, and research instruments used may be protected
  • 59.
    Intellectual Property • Principalinvestigator’s institution holds IP rights • Provide clearly stated license for producing derivatives, reusing, and redistributing datasets • License under Creative Commons • State if any restrictions or embargos on use • Provide example of how work should be cited to encourage proper attribution on reuse • Document any IP / copyright issues
  • 60.
    Ethics & DataSharing • Keep in mind the following ethical concerns when sharing your data: • Privacy • Confidentiality • Security and integrity of the data • For data involving human subjects, obtain written permission or consent stating how the data may be reused
  • 61.
    Best Practices =High Impact Data • File organization ensures easier access and retrieval of data • Documentation makes datasets accessible and intelligible to users • Storage and backup safeguards data • Data publishing and sharing encourages the most widespread reuse of data • Data protection ensures responsible reuse
  • 62.
    • Introduction • Background •The Impetus: NSF Data Management Plan Mandate • The Effect: Policy to Practice • The Response: Changing Data Landscape • Fundamentals Practices • File Organization • Data Documentation • Reliable Backup • Data Publishing, Sharing, & Reuse • Protecting Data & Responsible Reuse • Data Lifecycle Resources Agenda
  • 63.
  • 64.
    Contact Lisa M. Schmidt ElectronicRecords Archivist University Archives & Historical Collections [email protected] Aaron Collie Digital Curation Librarian MSU Libraries [email protected]

Editor's Notes

  • #3 Show of hands – how many here from the bench sciences, social sciences, humanities, medicine?
  • #4 Data management is about more than just the lost back-pack. It is about expert application. Expert application in any industry is expensive.
  • #5 In the academic industry data is the input to our final product. It takes years of training and experience to succeed in this field.
  • #6 Research is a process, it is scientific, and we use an overarching model to describe the process at a high level. But this is a conceptual model, it is not a process model. But this is a pretty sterile model; and we know that because it is not prescriptive to all academic disciplines.
  • #7 In practice, research is a complicated process. It is a creative process as well as a scientific process.
  • #8 Research is hard, managing research is boring. So we want tips that make it easier.
  • #9 This has been noticed.
  • #10 You might think of the scientific method as a bit of an iceberg model. At the tip of the iceberg are these general activities, but research isn’t really conducted at this high of a level.
  • #11 Research is a thing that happens at many levels simultaneously. The more experience you gain with research, the more of the depth chart you develop expertise within.
  • #12 Data management is asubprocess of research. It is part of a holistic research method that includes a ton of other functions like funding, literature reviews, workflows and publication.
  • #13 Today we are just going to focus on the one of these areas. Data management.
  • #17 HANDOUT: DMP (blue)
  • #18 HANDOUT: DMP examples (white)
  • #19 NSF’s data management plan requirementMay be up to two pages longPI may state that project will not generate data or samplesDMP is reviewed as part of intellectual merit or broader impacts of application, or both
  • #20 National Oceanic and Atmospheric Administration (NOAA)IMLS encourages sharing of research data. Applications that develop digital products must fill out an additional form with ten questions focused on “Developing Data Management Plans for Research Projects.The federal government has the right to obtain, reproduce, publish or otherwise use the data first produced under an award and authorize others to do so for government purposes.”Ex: Digging Into Data
  • #22 (OMB Circular A-10, Sec. 53; 42CFR, Part 50, Subpart A)
  • #23 Replication, transparency, re-use, mashups, repurposing, extending grant dollars and enabling more research…
  • #27 Electronic documents maintained together in one place, easily accessible to project staffReduces duplicate copies in personal drives and email attachments(Hierarchical/taxonomical/temporal)
  • #28 Benefits include:Electronic documents maintained together in one place and easily accessible to project staffData backed up and recoverable in the event of system failurePromote culture of sharing information as an institutional resource, rather than individual ownershipReduce duplicate copies in personal drives and email attachments
  • #29 Benefits include:Electronic documents maintained together in one place and easily accessible to project staffData backed up and recoverable in the event of system failurePromote culture of sharing information as an institutional resource, rather than individual ownershipReduce duplicate copies in personal drives and email attachments
  • #30 Will know how to name future folders as your project grows.
  • #32 Good practices
  • #35 Good choices include…Consider later lifecycle activitiesFlexibleWhat format used for analysis, preservation, etc.
  • #36 Consider later lifecycle activitiesFlexibleWhat format used for analysis, preservation, etc.
  • #37 Starting point
  • #39 Starting point
  • #40 Starting point
  • #41 Descriptive documentation that accompanies a dataset
  • #42 Better project transitions
  • #46 Mention new Backup Media Storage service offered by the University Archives.
  • #47 One event might be a dropped hard driveGood practicesBe wary of software lifespans, such as with course management software like ANGEL or Desire2Learn
  • #48 Mention new Backup Media Storage service offered by the University Archives.
  • #49 Mention new Backup Media Storage service offered by the University Archives.
  • #50 Mention new Backup Media Storage service offered by the University Archives.ANGEL, Desire2Learn, and Google Apps might be considered Cloud offerings from MSU. Good for collaboration and short term, don’t use for long-term storage.
  • #52 In bookletFor example….
  • #53 In bookletFor example….
  • #55 In bookletFor example….
  • #56 In bookletFor example….
  • #57 In bookletFor example….
  • #60 Principal investigator’s institution holds IP rights-- usually
  • #62 File organization ensures easier access and retrieval of data during and after projectDocumentation make datasets accessible and intelligible to usersStorage and backup safeguards data against technical failure, human error, and natural catastropheData publishing and sharing encourages the most widespread reuse of dataData protection ensures responsible reuse in light of intellectual property and ethical concernsIncrease impact of data and promote new research opportunities
  • #65 A Plus / Delta exercise focusing on extant infrastructure and servicesWeave known MSU resourcesDiscussion starters:Describe your interaction with dept, college, university, external bodies?What makes managing research data difficult?What services/tools do you need/want?Advice WebsiteDatabase designersTargeted seminar seriesData storage and curation options