Hadley Wickham 

@hadleywickham

Chief Scientist, RStudio
Managing 

many models
November 2016
You’ve never seen data presented
like this. With the drama and
urgency of a sportscaster,
statistics guru Hans Rosling
debunks myths about the so-
called “developing world.”
https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
40
60
80
1950 1960 1970 1980 1990 2000
year
lifeExp
142 countries
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
0.0
0.2
0.4
0.6
0.8
0.00 0.25 0.50 0.75 1.00
R2
Estimatedyearlyincreaseinlifeexpectancy
continent ● ● ● ● ●Africa Americas Asia Europe Oceania
But...
Arbitrarily complicated models
Three simple underlying ideas
Scales to bigdata
Each idea is partnered with a package
1. Nested data (tidyr)
2. Functional programming (purrr)
3. Models → tidy data (broom)
40
60
80
1950 1960 1970 1980 1990 2000
year
lifeExp
142 countries
Want to summarise each with a linear model
Currently our data has one row per observation
Country Year LifeEx
pAfghanistan 1952 28.9
Afghanistan 1957 30.3
Afghanistan ... ...
Albania 1952 55.2
Albania 1957 59.3
Albania ... ...
Algeria ... ...
... ... ...
More convenient to one row per group
Country Data
Afghanistan <df>
Albania <df>
Algeria <df>
... ...
Year LifeExp
1952 28.9
1957 30.3
... ...
Year LifeExp
1952 55.2
1957 59.3
... ...
I call this a nested data frame
library(dplyr)
library(tidyr)
by_country <- gapminder %>%
group_by(continent, country) %>%
nest()
In R:
Each country will have an associated model
Country Data
Afghanistan <df>
Albania <df>
Algeria <df>
... ...
lm(lifeExp ~ year1950, data = afghanistan)
lm(lifeExp1950 ~ year, data = albania)
Why not store that in a column too?
Country Data Model
Afghanistan <df> <lm>
Albania <df> <lm>
Algeria <df> <lm>
... ... ...
List-columns keep related things together
Anything can go in a list & a list can go in a data frame
library(dplyr)
library(purrr)
country_model <- function(df) {
lm(lifeExp ~ year1950, data = df)
}
models <- by_country %>%
mutate(
mod = map(data, country_model)
)
In R:
40
60
80
1950 1960 1970 1980 1990 2000
year
lifeExp
142 countries
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
0.0
0.2
0.4
0.6
0.8
0.00 0.25 0.50 0.75 1.00
R2
Estimatedyearlyincreaseinlifeexpectancy
continent ● ● ● ● ●Africa Americas Asia Europe Oceania
What can we do with a list of models?
Country Data Model
Afghanistan <data> <lm>
Albania <data> <lm>
Algeria <data> <lm>
... <data> <lm>
What data can we extract from a model?
year lifeEx
p
1952 69.4
1957 70.3
1962 71.2
1967 71.5
... ...
lm(lifeExp ~ year, data = nz)
R2=0.95
Intercept -307.7
Slope 0.19
year resid
1952 0.70
1957 0.61
1962 0.63
1967 -0.05
... ...
glance
tidy
augment
New Zealand
models <- models %>%
mutate(
glance = map(model, broom::glance),
tidy = map(model, broom::tidy),
augment = map(model, broom::augment)
)
We need to do that for each model
Which gives us:
Country Data Model Glance Tidy Augment
Afghanistan <df> <lm> <df> <df> <df>
Albania <df> <lm> <df> <df> <df>
Algeria <df> <lm> <df> <df> <df>
... ... ... ... ... ...
Unnest lets us go back to a regular data frame
Country Data
Afghanistan <df>
Albania <df>
Algeria <df>
... ...
Country Year LifeEx
pAfghanistan 1952 28.9
Afghanistan 1957 30.3
Afghanistan ... ...
Albania 1952 55.2
Albania 1957 59.3
Albania ... ...
Algeria ... ...
... ... ...
nest()
unnest()
Demo
1. Store related objects in 

list-columns.
2. Learn FP so you can focus on
verbs, not objects.
3. Use broom to convert models
to tidy data.
Data frames
Lists
dplyr
purrr
tidyr
Models
broom
Workflow replaces many
uses of ldply()/dlply() (plyr)
and do() + rowwise() (dplyr)
http://r4ds.had.co.nz/
This work is licensed under the 

Creative Commons Attribution-Noncommercial 3.0 

United States License.
To view a copy of this license, visit 

http://creativecommons.org/licenses/by-nc/3.0/us/

PLOTCON NYC: New Open Viz in R