05 subsetting

Stat405 Subsetting & shortcuts

Hadley Wickham
Tuesday, 7 September 2010

Roadmap
• Lectures 1-3: basic graphics
• Lectures 4-6: basic data handling
• Lectures 7-9: basic functions

• The absolutely most essential tools. Rest
of course is building your vocab, and
learning how to use them all together.


1. Character subsetting
2. Sorting
3. Shortcuts
4. Iteration
5. (Optional extra: command line tips)


Subsetting


Your turn

In pairs, try and recall the ﬁve types of
subsetting we talked about last week.
You have one minute!


blank include all

integer +ve: include
-ve: exclude

logical include TRUEs

character lookup by name


# Matches by names
diamonds[1:5, c("carat", "cut", "color")]

# Useful technique: change labelling
c("Fair" = "C", "Good" = "B", "Very Good" = "B+",
"Premium" = "A", "Ideal" = "A+")[diamonds$cut]

# Can also be used to collapse levels
table(c("Fair" = "C", "Good" = "B", "Very Good" =
"B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

# (see ?cut for continuous to discrete equivalent)


Sorting a data frame
x <- c(2, 4, 3, 1)
order(x)
# means: to get x in order, put 4th in
# 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th
x[order(x)]

# What does this do?
diamonds[order(diamonds$price), ]


# Order by x, then y, then z
order(diamonds$x, diamonds$y, diamonds$y)

# Put in order of quality
order(diamonds$color, desc(diamonds$cut),
desc(diamonds$clarity))

# desc sorts in descending order
# also found in the plyr package
x[order(x)]
x[order(desc(x))]


Your turn
Reorder the mpg dataset from most to
least efﬁcient.
The fl variable gives the type of fuel (r =
regular, d = diesel, p = premium, c = cng,
e = ethanol). Modify fl to spell out the
fuel type explicitly, collapsing c, d, and e
in a single other category.


Short cuts


Short cuts
You’ve been typing diamonds many many
times. These following shortcut save
typing, but may be a little harder to
understand, and will not work in some
situations. (Don’t forget the basics!)
Four speciﬁc to data frames, one more
generic.


Function Package

subset base

summarise plyr

transform base

arrange plyr
plyr is loaded automatically with ggplot2, or
load it explicitly with library(plyr).
base always automatically loaded

# subset: short cut for subsetting
zero_dim <- diamonds$x == 0 | diamonds$y == 0 |
diamonds$z == 0
diamonds[zero_dim & !is.na(zero_dim), ]

subset(diamonds, x == 0 | y == 0 | z == 0)

# summarise/summarize: short cut for creating summary
biggest <- data.frame(
price.max = max(diamonds$price),
carat.max = max(diamonds$carat))

biggest <- summarise(diamonds,
price.max = max(price),
carat.max = max(carat))


# transform: short cut for adding new variables
diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
diamonds$density <- diamonds$volume / diamonds$carat

diamonds <- transform(diamonds, volume = x * y * z)
diamonds <- transform(diamonds,
density = volume / carat)

# arrange: short cut for reordering
diamonds <- diamonds[order(diamonds$price,
desc(diamonds$carat)), ]

diamonds <- arrange(diamonds, price, desc(carat))


# They all have similar syntax. The first argument
# is a data frame, and all other arguments are
# interpreted in the context of that data frame
# (so you don't need to use data$ all the time)

subset(df, subset)
transform(df, var1 = expr1, ...)
summarise(df, var1 = expr1, ...)
arrange(df, var1, ...)

# They all return a modified data frame. You still
# have to save that to a variable if you want to
# keep it


Your turn
Use summarise, transform, subset and arrange
to:
Find all diamonds bigger than 3 carats and
order from most expensive to cheapest.
Add a new variable that estimates the
diameter of the diamond (average of x and y).
Compute depth (z / diameter * 100) yourself.
How does it compare to the depth in the data?


Aside:
never use attach!
Non-local effects; not symmetric; implicit,
not explicit.
Makes it very easy to make mistakes.
Use with() instead:
with(bnames, table(year, length))


# with is more general. Use in concert with other
# functions, particularly those that don't have a data
# argument

diamonds$volume <- with(diamonds, x * y * z)

# This won't work:
with(diamonds, volume <- x * y * z)
# with only changes lookup, not assignment


Iteration


Stories
Best data analyses tell a story, with a
natural ﬂow from beginning to end.
For homeworks, try and come up with
three plots that tell a story.
Stories about a small sample of the data
can work well.


qplot(x, y, data = diamonds)
qplot(x, z, data = diamonds)

# Start by fixing incorrect values

y_big <- diamonds$y > 10
z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0
y_zero <- diamonds$y == 0
z_zero <- diamonds$z == 0

diamonds$x[x_zero] <- NA
diamonds$y[y_zero | y_big] <- NA
diamonds$z[z_zero | z_big] <- NA

qplot(x, y, data = diamonds)
# How can I get rid of those outliers?

qplot(x, x - y, data = diamonds)
qplot(x - y, data = diamonds, binwidth = 0.01)
last_plot() + xlim(-0.5, 0.5)
last_plot() + xlim(-0.2, 0.2)

asym <- abs(diamonds$x - diamonds$y) > 0.2
diamonds_sym <- diamonds[!asym, ]

# Did it work?
qplot(x, y, data = diamonds_sym)
qplot(x, x - y, data = diamonds_sym)
# Something interesting is going on there!
qplot(x, x - y, data = diamonds_sym,
geom = "bin2d", binwidth = c(0.1, 0.01))


# What about x and z?
qplot(x, z, data = diamonds_sym)
qplot(x, x - z, data = diamonds_sym)
# Subtracting doesn't work - z smaller than x and y
qplot(x, x / z, data = diamonds_sym)
# But better to log transform to make symmetrical
qplot(x, log10(x / z), data = diamonds_sym)

# and so on...


# How does symmetry relate to price?
qplot(abs(x - y), price, data =diamonds_sym) +
geom_smooth()

qplot(abs(x - y), price, data = diamonds_sym, geom =
"boxplot", group = round(abs(x-y) * 10))

diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x -
diamonds_sym$y))
qplot(sym, price, data = diamonds_sym,
geom = "boxplot", group = sym)
# Are asymmetric diamonds worth more?

qplot(carat, price, data = diamonds_sym, colour = sym)
qplot(log10(carat), log10(price), data = diamonds_sym, colour
= sym, group = sym) + geom_smooth(method = lm, se = F)


# Modelling

summary(lm(log10(price) ~ log10(carat) + sym,
data = diamonds_sym))
# But statistical significance != practical
# significance

sd(diamonds_sym$sym, na.rm = T)
# [1] 0.02368828

# So 1 sd increase in sym, decreases log10(price)
# by -0.01 (= 0.23 * -0.44)
# 10 ^ -0.01 = 0.976
# So 1 sd increase in sym decreases price by ~2%


Command
line

Why?

Provenance & reproducibility.
Working with remote servers.
Automation & scripting.
Common tools.


Basics
pwd: the location of the current directory
ls: the ﬁles in the current directory
cd: change to another directory
cd ..: change to parent directory
cd ~: change to home directory
mkdir: create a new directory


Your turn
Create a directory for stat405.
Inside that directory, create a directory for
homework 2.
Confirm that there are no files in that
directory.
Navigate back to your home directory.
What other files are there?


05 subsetting

More Related Content

What's hot (8)

Viewers also liked (6)

Similar to 05 subsetting (20)

More from Hadley Wickham (20)

05 subsetting