SlideShare a Scribd company logo
2
Most read
5
Most read
12
Most read
Reproducible Data Science with R
David Smith
R Community Lead, Microsoft
@revodavid
What is reproducibility?
“Two honest researchers would
get the same result”
– John Mount
• Transparent data sourcing and availability
• Fully automated analysis pipeline (code provided)
• Traceability from published results back to data
2
Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
3
Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
4
Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
5
Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
6
Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
7
Accessing Data
• Data sources
– databases, sensor logs, spreadsheets, file download, APIs, …
– Remember, these are sources: don’t modify them!
• Snapshot data into local static files (text is good)
– You will likely include some ETL steps here
– Record a timestamp of when data was extracted: Sys.time()
– With big data, sample during development, scale up to finalize
– Document how this was all done, preferably with code (source file)!
• Import text files using R functions
– Recommended package: readr (avoids issues with locale, dates, etc)
8
Analysis process
• Interactively explore data & develop analyses as usual
• Capture the entire process in an R script
– library(tidyverse) is helpful for cleaning, feature generation, etc.
– Re-usable components can be shared as (private) packages
• Generate artifacts using scripts
– Graphics (please, no JPGs!)
– Tables
– Documents
• Include timestamps in output: Sys.time()
9
A reproducible analysis environment
• Operating system and R version
– For most purposes, not the biggest cause of issues
• But do document your R session info: sessionInfo()
– For production, consider VM / container
• A clean R environment
– Organize work into independent projects (directories)
– Use relative paths in scripts
– Avoid use of .Rprofile
– Set explicit random seeds
– Do not save R workspace
10
Managing a changing R package ecosystem
• One command to lock package versions to a specific date:
checkpoint("2017-04-11")
(For collaborators, same command downloads required package versions)
• Applies just to this project
– Global package upgrades won’t break reproducibility in other projects
• checkpoint package avail on CRAN, included with Microsoft R
11
Presenting results
• Eliminate manual processes
(as far as possible)
– Annotations (graphs / tables)
– Cut-and-paste into documents
• Notebooks
– Combines code, output and
narrative
– Good for collaboration with
other researchers
• Document Generation
– Best for automating reports
12
knitr / Rmarkdown
• Generate HTML, Word,
or PDF reports
– Or books, blogs, slides, …
• Combine narrative and
R code in a single
document
• Human-readable, easy
to edit
• Single-click update!
13
Collaboration and sharing
• Just share R project folder
• Publish on Github
– Happy Git and GitHub for the
useR, Jenny Bryan
http://happygitwithr.com/
– Version retention and tracking
– Collaboration (code and
comments)
14
Take-Aways
Reproducibility is Beneficial
• Saves time
• Produces better science
• More trusted research
• Reduced risk of errors
• Encourages collaboration
Reproducibility is Simple
• Document and automate processes
with R scripts
• Read and clean data with tidyverse
• Use checkpoint to manage package
versions
• Generate documents with knitr
• Share reproducible projects with
Github
15
Reproducible Data Science with R
David Smith
R Community Lead, Microsoft
@revodavid

More Related Content

What's hot (6)

PPTX
Biomass as a Source of Energy
JJ Technical Solutions
 
PPTX
Metāliskasi drudzis
Ilga Grīnberga
 
PPTX
Biofuels
Nischith Nbs
 
PDF
Advancements in Offshore Corrosion Monitoring and Flaw Detection
Olympus IMS
 
PPTX
Biofuel presentation org
appchem
 
PPTX
Radioactive waste management
Niranjan Kumar
 
Biomass as a Source of Energy
JJ Technical Solutions
 
Metāliskasi drudzis
Ilga Grīnberga
 
Biofuels
Nischith Nbs
 
Advancements in Offshore Corrosion Monitoring and Flaw Detection
Olympus IMS
 
Biofuel presentation org
appchem
 
Radioactive waste management
Niranjan Kumar
 

Similar to Reproducible Data Science with R (20)

PPTX
A Step Towards Reproducibility in R
Revolution Analytics
 
PPTX
Intro to Reproducible Research
C. Tobin Magle
 
PPTX
Reproducible research concepts and tools
C. Tobin Magle
 
PPTX
Reproducible research
C. Tobin Magle
 
PDF
Reproducibility with Revolution R Open and the Checkpoint Package
Revolution Analytics
 
PPTX
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 
PDF
R - the language
Mike Martinez
 
PPTX
R and Data Science
Revolution Analytics
 
PDF
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
 
PPTX
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
 
PPTX
Reproducibility with Checkpoint & RRO
Work-Bench
 
PPT
Importance and Challenges of Reproducible Research
Vladimir Kanchev
 
PPTX
R reproducibility
Revolution Analytics
 
PPT
Brief introduction to R Lecturenotes1_R .ppt
geethar79
 
PPT
R_Language_study_forstudents_R_Material.ppt
Suresh Babu
 
PPT
Lecture1_R Programming Introduction1.ppt
premak23
 
PPTX
Reproducibility with Revolution R Open
Revolution Analytics
 
PPT
Lecture1_R.ppt
vikassingh569137
 
PPT
Lecture1 r
Sandeep242951
 
PPT
Modeling in R Programming Language for Beginers.ppt
anshikagoel52
 
A Step Towards Reproducibility in R
Revolution Analytics
 
Intro to Reproducible Research
C. Tobin Magle
 
Reproducible research concepts and tools
C. Tobin Magle
 
Reproducible research
C. Tobin Magle
 
Reproducibility with Revolution R Open and the Checkpoint Package
Revolution Analytics
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 
R - the language
Mike Martinez
 
R and Data Science
Revolution Analytics
 
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
 
Reproducibility with Checkpoint & RRO
Work-Bench
 
Importance and Challenges of Reproducible Research
Vladimir Kanchev
 
R reproducibility
Revolution Analytics
 
Brief introduction to R Lecturenotes1_R .ppt
geethar79
 
R_Language_study_forstudents_R_Material.ppt
Suresh Babu
 
Lecture1_R Programming Introduction1.ppt
premak23
 
Reproducibility with Revolution R Open
Revolution Analytics
 
Lecture1_R.ppt
vikassingh569137
 
Lecture1 r
Sandeep242951
 
Modeling in R Programming Language for Beginers.ppt
anshikagoel52
 
Ad

More from Revolution Analytics (20)

PPTX
Speeding up R with Parallel Programming in the Cloud
Revolution Analytics
 
PPTX
Migrating Existing Open Source Machine Learning to Azure
Revolution Analytics
 
PPTX
R in Minecraft
Revolution Analytics
 
PPTX
The case for R for AI developers
Revolution Analytics
 
PPTX
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
PPTX
The R Ecosystem
Revolution Analytics
 
PPTX
R Then and Now
Revolution Analytics
 
PPTX
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
PPTX
The Value of Open Source Communities
Revolution Analytics
 
PPTX
The R Ecosystem
Revolution Analytics
 
PPTX
R at Microsoft (useR! 2016)
Revolution Analytics
 
PPTX
Building a scalable data science platform with R
Revolution Analytics
 
PPTX
R at Microsoft
Revolution Analytics
 
PPTX
The Business Economics and Opportunity of Open Source Data Science
Revolution Analytics
 
PPTX
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
PPTX
The Network structure of R packages on CRAN & BioConductor
Revolution Analytics
 
PPTX
The network structure of cran 2015 07-02 final
Revolution Analytics
 
PPTX
Simple Reproducibility with the checkpoint package
Revolution Analytics
 
PPTX
R at Microsoft
Revolution Analytics
 
PDF
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution Analytics
 
Speeding up R with Parallel Programming in the Cloud
Revolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Revolution Analytics
 
R in Minecraft
Revolution Analytics
 
The case for R for AI developers
Revolution Analytics
 
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
The R Ecosystem
Revolution Analytics
 
R Then and Now
Revolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
The Value of Open Source Communities
Revolution Analytics
 
The R Ecosystem
Revolution Analytics
 
R at Microsoft (useR! 2016)
Revolution Analytics
 
Building a scalable data science platform with R
Revolution Analytics
 
R at Microsoft
Revolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
Revolution Analytics
 
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
Revolution Analytics
 
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Simple Reproducibility with the checkpoint package
Revolution Analytics
 
R at Microsoft
Revolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution Analytics
 
Ad

Recently uploaded (20)

PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 

Reproducible Data Science with R

  • 1. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid
  • 2. What is reproducibility? “Two honest researchers would get the same result” – John Mount • Transparent data sourcing and availability • Fully automated analysis pipeline (code provided) • Traceability from published results back to data 2
  • 3. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 3
  • 4. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 4
  • 5. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 5
  • 6. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 6
  • 7. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 7
  • 8. Accessing Data • Data sources – databases, sensor logs, spreadsheets, file download, APIs, … – Remember, these are sources: don’t modify them! • Snapshot data into local static files (text is good) – You will likely include some ETL steps here – Record a timestamp of when data was extracted: Sys.time() – With big data, sample during development, scale up to finalize – Document how this was all done, preferably with code (source file)! • Import text files using R functions – Recommended package: readr (avoids issues with locale, dates, etc) 8
  • 9. Analysis process • Interactively explore data & develop analyses as usual • Capture the entire process in an R script – library(tidyverse) is helpful for cleaning, feature generation, etc. – Re-usable components can be shared as (private) packages • Generate artifacts using scripts – Graphics (please, no JPGs!) – Tables – Documents • Include timestamps in output: Sys.time() 9
  • 10. A reproducible analysis environment • Operating system and R version – For most purposes, not the biggest cause of issues • But do document your R session info: sessionInfo() – For production, consider VM / container • A clean R environment – Organize work into independent projects (directories) – Use relative paths in scripts – Avoid use of .Rprofile – Set explicit random seeds – Do not save R workspace 10
  • 11. Managing a changing R package ecosystem • One command to lock package versions to a specific date: checkpoint("2017-04-11") (For collaborators, same command downloads required package versions) • Applies just to this project – Global package upgrades won’t break reproducibility in other projects • checkpoint package avail on CRAN, included with Microsoft R 11
  • 12. Presenting results • Eliminate manual processes (as far as possible) – Annotations (graphs / tables) – Cut-and-paste into documents • Notebooks – Combines code, output and narrative – Good for collaboration with other researchers • Document Generation – Best for automating reports 12
  • 13. knitr / Rmarkdown • Generate HTML, Word, or PDF reports – Or books, blogs, slides, … • Combine narrative and R code in a single document • Human-readable, easy to edit • Single-click update! 13
  • 14. Collaboration and sharing • Just share R project folder • Publish on Github – Happy Git and GitHub for the useR, Jenny Bryan http://happygitwithr.com/ – Version retention and tracking – Collaboration (code and comments) 14
  • 15. Take-Aways Reproducibility is Beneficial • Saves time • Produces better science • More trusted research • Reduced risk of errors • Encourages collaboration Reproducibility is Simple • Document and automate processes with R scripts • Read and clean data with tidyverse • Use checkpoint to manage package versions • Generate documents with knitr • Share reproducible projects with Github 15
  • 16. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid