Why R?
Jeffrey Stanton
Syracuse University
What is R?
• R is a statistics, data management, and
graphics platform
• R is open source, maintained and developed
by a community of developers.
• The R code repository, as well as compiled
binaries (ready-to-install software) available
at: http://cran.r-project.org
• R comprises a core program plus 1000s of
freely available add-in packages.
CRAN
So Why or Why Not R?
• Most popular statistics software (other than R)
and some of their audiences:
– SPSS: Social Scientists
– Stata: Social Scientists
– Mathematica/Matlab: Engineers, mathematicians,
computer scientists, and physicists
– Python/NumPy: Computer scientists, web developers
– SAS: Data intensive industries (e.g., financial services)
– Excel: All types of organizations

• R is more popular and used by a larger number of
analysts than each of these
http://r4stats.com/articles/popularity/
But. . .
• Statistics users like point and click
• R is command line oriented; there are GUIs that
can be loaded as add-on packages;
• R-Studio is a Integrated Development
Environment (IDE) for R, but more for code
development than statistical analysis
• R is free, but this also means that there is no
formal support mechanism; large organizations
often like to contract with a commercial provider
R-Studio
Command Line? Advantages?
• In social sciences there has been a lot of talk
lately about replication, the necessity of having
results that are reproducible
• In the world of “big data,” analysts want to
produce systems that are transparent, reliable,
and that maintain a chain of provenance for each
transformation that affects the data
• Looking at statistical analysis as a kind of
“programming” task (like the old days!) has
immense advantages
Look Out! Real Code!
# Read U.S. States shape data from census GIS data set
usShape <- readShapeSpatial("gz_2010_us_040_00_500k.shp")
# Attach the delta CPI data to the states
usShape@data$delta <- stateCPIdelta # Consumer price indices in this table
# This sets up break points for color designations.
# We want 20 gradations of color across all choropleths.
bfloor <- floor(min(usShape@data[,"delta"],na.rm=TRUE)*10)/10
bceil <- (ceiling(max(usShape@data[,"delta"],na.rm=TRUE)*10)/10) + 20
breaks <- seq(bfloor, bceil, 20)
# Attach the color cut points to the shape data
usShape@data$zCat <- cut(usShape@data[,"delta"],breaks,include.lowest=TRUE)
cutpoints <- levels(usShape@data$zCat) # For later use with the legend
Colorful!
Many Packages - CRAN Task View
ChemPhys
Econometrics
Environmetrics
ExperimentalDesign
Finance
Genetics
Graphics
HighPerformanceComputing
MachineLearning
MedicalImaging
MetaAnalysis
Multivariate
NaturalLanguageProcessing
Optimization
Pharmacokinetics
Phylogenetics
Psychometrics
ReproducibleResearch
SocialSciences
Spatial
Survival
TimeSeries
WebTechnologies

Chemometrics and Computational Physics
Computational Econometrics
Analysis of Ecological and Environmental Data
Design of Experiments (DoE) & Analysis of Experimental Data
Empirical Finance
Statistical Genetics
Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
High-Performance and Parallel Computing with R
Machine Learning & Statistical Learning
Medical Image Analysis
Meta-Analysis
Multivariate Statistics
Natural Language Processing
Optimization and Mathematical Programming
Analysis of Pharmacokinetic Data
Phylogenetics, Especially Comparative Methods
Psychometric Models and Methods
Reproducible Research
Statistics for the Social Sciences
Analysis of Spatial Data
Survival Analysis
Time Series Analysis
Web Technologies and Services
Why R?
• Free and open source
• Huge community of users, enormous
repository of working code examples, many
sources of online expertise/support
• Dizzying array of add-on packages for almost
any imaginable data application
• Encourages good data practice: coding a
reproducible chain of data transformations
Jsresearch.net

Why R? A Brief Introduction to the Open Source Statistics Platform

  • 1.
  • 2.
    What is R? •R is a statistics, data management, and graphics platform • R is open source, maintained and developed by a community of developers. • The R code repository, as well as compiled binaries (ready-to-install software) available at: http://cran.r-project.org • R comprises a core program plus 1000s of freely available add-in packages.
  • 3.
  • 4.
    So Why orWhy Not R? • Most popular statistics software (other than R) and some of their audiences: – SPSS: Social Scientists – Stata: Social Scientists – Mathematica/Matlab: Engineers, mathematicians, computer scientists, and physicists – Python/NumPy: Computer scientists, web developers – SAS: Data intensive industries (e.g., financial services) – Excel: All types of organizations • R is more popular and used by a larger number of analysts than each of these
  • 5.
  • 6.
    But. . . •Statistics users like point and click • R is command line oriented; there are GUIs that can be loaded as add-on packages; • R-Studio is a Integrated Development Environment (IDE) for R, but more for code development than statistical analysis • R is free, but this also means that there is no formal support mechanism; large organizations often like to contract with a commercial provider
  • 7.
  • 8.
    Command Line? Advantages? •In social sciences there has been a lot of talk lately about replication, the necessity of having results that are reproducible • In the world of “big data,” analysts want to produce systems that are transparent, reliable, and that maintain a chain of provenance for each transformation that affects the data • Looking at statistical analysis as a kind of “programming” task (like the old days!) has immense advantages
  • 9.
    Look Out! RealCode! # Read U.S. States shape data from census GIS data set usShape <- readShapeSpatial("gz_2010_us_040_00_500k.shp") # Attach the delta CPI data to the states usShape@data$delta <- stateCPIdelta # Consumer price indices in this table # This sets up break points for color designations. # We want 20 gradations of color across all choropleths. bfloor <- floor(min(usShape@data[,"delta"],na.rm=TRUE)*10)/10 bceil <- (ceiling(max(usShape@data[,"delta"],na.rm=TRUE)*10)/10) + 20 breaks <- seq(bfloor, bceil, 20) # Attach the color cut points to the shape data usShape@data$zCat <- cut(usShape@data[,"delta"],breaks,include.lowest=TRUE) cutpoints <- levels(usShape@data$zCat) # For later use with the legend
  • 10.
  • 11.
    Many Packages -CRAN Task View ChemPhys Econometrics Environmetrics ExperimentalDesign Finance Genetics Graphics HighPerformanceComputing MachineLearning MedicalImaging MetaAnalysis Multivariate NaturalLanguageProcessing Optimization Pharmacokinetics Phylogenetics Psychometrics ReproducibleResearch SocialSciences Spatial Survival TimeSeries WebTechnologies Chemometrics and Computational Physics Computational Econometrics Analysis of Ecological and Environmental Data Design of Experiments (DoE) & Analysis of Experimental Data Empirical Finance Statistical Genetics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization High-Performance and Parallel Computing with R Machine Learning & Statistical Learning Medical Image Analysis Meta-Analysis Multivariate Statistics Natural Language Processing Optimization and Mathematical Programming Analysis of Pharmacokinetic Data Phylogenetics, Especially Comparative Methods Psychometric Models and Methods Reproducible Research Statistics for the Social Sciences Analysis of Spatial Data Survival Analysis Time Series Analysis Web Technologies and Services
  • 12.
    Why R? • Freeand open source • Huge community of users, enormous repository of working code examples, many sources of online expertise/support • Dizzying array of add-on packages for almost any imaginable data application • Encourages good data practice: coding a reproducible chain of data transformations
  • 13.