SlideShare a Scribd company logo
Stat405   Subsetting & shortcuts


                            Hadley Wickham
Tuesday, 7 September 2010
Roadmap
                   • Lectures 1-3: basic graphics
                   • Lectures 4-6: basic data handling
                   • Lectures 7-9: basic functions


                   • The absolutely most essential tools. Rest
                     of course is building your vocab, and
                     learning how to use them all together.


Tuesday, 7 September 2010
1. Character subsetting
               2. Sorting
               3. Shortcuts
               4. Iteration
               5. (Optional extra: command line tips)



Tuesday, 7 September 2010
Subsetting

Tuesday, 7 September 2010
Your turn

                   In pairs, try and recall the five types of
                   subsetting we talked about last week.
                   You have one minute!




Tuesday, 7 September 2010
blank     include all


                            integer   +ve: include
                                      -ve: exclude

                            logical   include TRUEs


                            character lookup by name


Tuesday, 7 September 2010
# Matches by names
     diamonds[1:5, c("carat", "cut", "color")]

     # Useful technique: change labelling
     c("Fair" = "C", "Good" = "B", "Very Good" = "B+",
     "Premium" = "A", "Ideal" = "A+")[diamonds$cut]

     # Can also be used to collapse levels
     table(c("Fair" = "C", "Good" = "B", "Very Good" =
     "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

     # (see ?cut for continuous to discrete equivalent)



Tuesday, 7 September 2010
Sorting a data frame
               x <- c(2, 4, 3, 1)
               order(x)
               # means: to get x in order, put 4th in
               # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th
               x[order(x)]

               # What does this do?
               diamonds[order(diamonds$price), ]


Tuesday, 7 September 2010
# Order by x, then y, then z
     order(diamonds$x, diamonds$y, diamonds$y)

     # Put in order of quality
     order(diamonds$color, desc(diamonds$cut),
       desc(diamonds$clarity))

     # desc sorts in descending order
     # also found in the plyr package
     x[order(x)]
     x[order(desc(x))]



Tuesday, 7 September 2010
Your turn
                   Reorder the mpg dataset from most to
                   least efficient.
                   The fl variable gives the type of fuel (r =
                   regular, d = diesel, p = premium, c = cng,
                   e = ethanol). Modify fl to spell out the
                   fuel type explicitly, collapsing c, d, and e
                   in a single other category.


Tuesday, 7 September 2010
Short cuts

Tuesday, 7 September 2010
Short cuts
                   You’ve been typing diamonds many many
                   times. These following shortcut save
                   typing, but may be a little harder to
                   understand, and will not work in some
                   situations. (Don’t forget the basics!)
                   Four specific to data frames, one more
                   generic.


Tuesday, 7 September 2010
Function                Package

                                  subset                    base

                                summarise                   plyr

                                transform                   base

                                 arrange                    plyr
                            plyr is loaded automatically with ggplot2, or
                            load it explicitly with library(plyr).
                            base always automatically loaded
Tuesday, 7 September 2010
# subset: short cut for subsetting
     zero_dim <- diamonds$x == 0 | diamonds$y == 0 |
       diamonds$z == 0
     diamonds[zero_dim & !is.na(zero_dim), ]

     subset(diamonds, x == 0 | y == 0 | z == 0)

     # summarise/summarize: short cut for creating summary
     biggest <- data.frame(
       price.max = max(diamonds$price),
       carat.max = max(diamonds$carat))

     biggest <- summarise(diamonds,
       price.max = max(price),
       carat.max = max(carat))

Tuesday, 7 September 2010
# transform: short cut for adding new variables
     diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
     diamonds$density <- diamonds$volume / diamonds$carat

     diamonds <- transform(diamonds, volume = x * y * z)
     diamonds <- transform(diamonds,
       density = volume / carat)

     # arrange: short cut for reordering
     diamonds <- diamonds[order(diamonds$price,
       desc(diamonds$carat)), ]

     diamonds <- arrange(diamonds, price, desc(carat))

Tuesday, 7 September 2010
#     They all have similar syntax. The first argument
     #     is a data frame, and all other arguments are
     #     interpreted in the context of that data frame
     #     (so you don't need to use data$ all the time)

     subset(df, subset)
     transform(df, var1 = expr1, ...)
     summarise(df, var1 = expr1, ...)
     arrange(df, var1, ...)

     # They all return a modified data frame. You still
     # have to save that to a variable if you want to
     # keep it


Tuesday, 7 September 2010
Your turn
                   Use summarise, transform, subset and arrange
                   to:
                   Find all diamonds bigger than 3 carats and
                   order from most expensive to cheapest.
                   Add a new variable that estimates the
                   diameter of the diamond (average of x and y).
                   Compute depth (z / diameter * 100) yourself.
                   How does it compare to the depth in the data?


Tuesday, 7 September 2010
Aside:
                            never use attach!
                   Non-local effects; not symmetric; implicit,
                   not explicit.
                   Makes it very easy to make mistakes.
                   Use with() instead:
                   with(bnames, table(year, length))



Tuesday, 7 September 2010
# with is more general. Use in concert with other
     # functions, particularly those that don't have a data
     # argument

     diamonds$volume <- with(diamonds, x * y * z)

     # This won't work:
     with(diamonds, volume <- x * y * z)
     # with only changes lookup, not assignment




Tuesday, 7 September 2010
Iteration

Tuesday, 7 September 2010
Stories
                   Best data analyses tell a story, with a
                   natural flow from beginning to end.
                   For homeworks, try and come up with
                   three plots that tell a story.
                   Stories about a small sample of the data
                   can work well.



Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     qplot(x, z, data = diamonds)

     # Start by fixing incorrect values

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0

     diamonds$x[x_zero] <- NA
     diamonds$y[y_zero | y_big] <- NA
     diamonds$z[z_zero | z_big] <- NA
Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     # How can I get rid of those outliers?

     qplot(x, x - y, data = diamonds)
     qplot(x - y, data = diamonds, binwidth = 0.01)
     last_plot() + xlim(-0.5, 0.5)
     last_plot() + xlim(-0.2, 0.2)

     asym <- abs(diamonds$x - diamonds$y) > 0.2
     diamonds_sym <- diamonds[!asym, ]

     # Did it work?
     qplot(x, y, data = diamonds_sym)
     qplot(x, x - y, data = diamonds_sym)
     # Something interesting is going on there!
     qplot(x, x - y, data = diamonds_sym,
       geom = "bin2d", binwidth = c(0.1, 0.01))

Tuesday, 7 September 2010
# What about x and z?
     qplot(x, z, data = diamonds_sym)
     qplot(x, x - z, data = diamonds_sym)
     # Subtracting doesn't work - z smaller than x and y
     qplot(x, x / z, data = diamonds_sym)
     # But better to log transform to make symmetrical
     qplot(x, log10(x / z), data = diamonds_sym)

     # and so on...




Tuesday, 7 September 2010
# How does symmetry relate to price?
     qplot(abs(x - y), price, data =diamonds_sym) +
       geom_smooth()

     qplot(abs(x - y), price, data = diamonds_sym, geom =
     "boxplot", group = round(abs(x-y) * 10))

     diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x -
       diamonds_sym$y))
     qplot(sym, price, data = diamonds_sym,
       geom = "boxplot", group = sym)
     # Are asymmetric diamonds worth more?

     qplot(carat, price, data = diamonds_sym, colour = sym)
     qplot(log10(carat), log10(price), data = diamonds_sym, colour
     = sym, group = sym) + geom_smooth(method = lm, se = F)


Tuesday, 7 September 2010
# Modelling

     summary(lm(log10(price) ~ log10(carat) + sym,
       data = diamonds_sym))
     # But statistical significance != practical
     # significance

     sd(diamonds_sym$sym, na.rm = T)
     # [1] 0.02368828

     #     So       1 sd increase in sym, decreases log10(price)
     #     by       -0.01 (= 0.23 * -0.44)
     #     10       ^ -0.01 = 0.976
     #     So       1 sd increase in sym decreases price by ~2%


Tuesday, 7 September 2010
Command
                       line
Tuesday, 7 September 2010
Why?

                   Provenance & reproducibility.
                   Working with remote servers.
                   Automation & scripting.
                   Common tools.




Tuesday, 7 September 2010
Basics
                   pwd: the location of the current directory
                   ls: the files in the current directory
                   cd: change to another directory
                   cd ..: change to parent directory
                   cd ~: change to home directory
                   mkdir: create a new directory


Tuesday, 7 September 2010
Your turn
                   Create a directory for stat405.
                   Inside that directory, create a directory for
                   homework 2.
                   Confirm that there are no files in that
                   directory.
                   Navigate back to your home directory.
                   What other files are there?


Tuesday, 7 September 2010

More Related Content

PDF
08 functions
Hadley Wickham
 
PDF
03 Cleaning
Hadley Wickham
 
PDF
04 reports
Hadley Wickham
 
PPTX
Marimba - Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggreg...
Johannes Schildgen
 
PPTX
Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate...
Johannes Schildgen
 
PDF
Mysql index
Yuan Yao
 
PDF
04 Reports
Hadley Wickham
 
PPTX
Scala in a Java 8 World
Daniel Blyth
 
08 functions
Hadley Wickham
 
03 Cleaning
Hadley Wickham
 
04 reports
Hadley Wickham
 
Marimba - Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggreg...
Johannes Schildgen
 
Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate...
Johannes Schildgen
 
Mysql index
Yuan Yao
 
04 Reports
Hadley Wickham
 
Scala in a Java 8 World
Daniel Blyth
 

What's hot (8)

PDF
Codice legacy, usciamo dal pantano! @iad11
Stefano Leli
 
PDF
08 Functions
Hadley Wickham
 
PDF
14 Ddply
Hadley Wickham
 
PDF
TreSQL
Dmitry Buzdin
 
PDF
Lec2
Amba Research
 
PPTX
Intro to Python (High School) Unit #3
Jay Coskey
 
PDF
Python book
Victor Rabinovich
 
KEY
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Enplus Advisors, Inc.
 
Codice legacy, usciamo dal pantano! @iad11
Stefano Leli
 
08 Functions
Hadley Wickham
 
14 Ddply
Hadley Wickham
 
TreSQL
Dmitry Buzdin
 
Intro to Python (High School) Unit #3
Jay Coskey
 
Python book
Victor Rabinovich
 
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Enplus Advisors, Inc.
 
Ad

Viewers also liked (6)

PDF
07 Problem Solving
Hadley Wickham
 
PDF
Yet another object system for R
Hadley Wickham
 
PDF
13 case-study
Hadley Wickham
 
PDF
16 Git
Hadley Wickham
 
PDF
03 extensions
Hadley Wickham
 
PDF
06 Data
Hadley Wickham
 
07 Problem Solving
Hadley Wickham
 
Yet another object system for R
Hadley Wickham
 
13 case-study
Hadley Wickham
 
03 extensions
Hadley Wickham
 
Ad

Similar to 05 subsetting (20)

PDF
07 problem-solving
Hadley Wickham
 
PDF
02 Ddply
Hadley Wickham
 
PDF
06 data
Hadley Wickham
 
PDF
21 Polishing
Hadley Wickham
 
PDF
12 Ddply
Hadley Wickham
 
PDF
Clojure night
Aria Haghighi
 
PDF
22 spam
Hadley Wickham
 
PDF
R Workshop for Beginners
Metamarkets
 
PDF
Python utan-stodhjul-motorsag
niklal
 
PDF
11 Data Structures
Hadley Wickham
 
PDF
Ggplot in python
Ajay Ohri
 
PDF
The jQuery Divide
Rebecca Murphey
 
PDF
Immutability
Yung-Luen Lan
 
PDF
Tkinter Does Not Suck
Richard Jones
 
KEY
Document Classification In PHP
Ian Barber
 
PDF
Paradigmas de programação funcional + objetos no liquidificador com scala
Bruno Oliveira
 
PPTX
Basic Graphics with R
Syracuse University
 
PPTX
ES6 is Nigh
Domenic Denicola
 
PDF
Session 02
Felix Müller
 
PDF
Progressive Advancement, by way of progressive enhancement
Paul Irish
 
07 problem-solving
Hadley Wickham
 
02 Ddply
Hadley Wickham
 
21 Polishing
Hadley Wickham
 
12 Ddply
Hadley Wickham
 
Clojure night
Aria Haghighi
 
R Workshop for Beginners
Metamarkets
 
Python utan-stodhjul-motorsag
niklal
 
11 Data Structures
Hadley Wickham
 
Ggplot in python
Ajay Ohri
 
The jQuery Divide
Rebecca Murphey
 
Immutability
Yung-Luen Lan
 
Tkinter Does Not Suck
Richard Jones
 
Document Classification In PHP
Ian Barber
 
Paradigmas de programação funcional + objetos no liquidificador com scala
Bruno Oliveira
 
Basic Graphics with R
Syracuse University
 
ES6 is Nigh
Domenic Denicola
 
Session 02
Felix Müller
 
Progressive Advancement, by way of progressive enhancement
Paul Irish
 

More from Hadley Wickham (20)

PDF
27 development
Hadley Wickham
 
PDF
27 development
Hadley Wickham
 
PDF
24 modelling
Hadley Wickham
 
PDF
23 data-structures
Hadley Wickham
 
PDF
Graphical inference
Hadley Wickham
 
PDF
R packages
Hadley Wickham
 
PDF
21 spam
Hadley Wickham
 
PDF
20 date-times
Hadley Wickham
 
PDF
19 tables
Hadley Wickham
 
PDF
18 cleaning
Hadley Wickham
 
PDF
17 polishing
Hadley Wickham
 
PDF
16 critique
Hadley Wickham
 
PDF
15 time-space
Hadley Wickham
 
PDF
14 case-study
Hadley Wickham
 
PDF
12 adv-manip
Hadley Wickham
 
PDF
11 adv-manip
Hadley Wickham
 
PDF
11 adv-manip
Hadley Wickham
 
PDF
10 simulation
Hadley Wickham
 
PDF
10 simulation
Hadley Wickham
 
PDF
09 bootstrapping
Hadley Wickham
 
27 development
Hadley Wickham
 
27 development
Hadley Wickham
 
24 modelling
Hadley Wickham
 
23 data-structures
Hadley Wickham
 
Graphical inference
Hadley Wickham
 
R packages
Hadley Wickham
 
20 date-times
Hadley Wickham
 
19 tables
Hadley Wickham
 
18 cleaning
Hadley Wickham
 
17 polishing
Hadley Wickham
 
16 critique
Hadley Wickham
 
15 time-space
Hadley Wickham
 
14 case-study
Hadley Wickham
 
12 adv-manip
Hadley Wickham
 
11 adv-manip
Hadley Wickham
 
11 adv-manip
Hadley Wickham
 
10 simulation
Hadley Wickham
 
10 simulation
Hadley Wickham
 
09 bootstrapping
Hadley Wickham
 

05 subsetting

  • 1. Stat405 Subsetting & shortcuts Hadley Wickham Tuesday, 7 September 2010
  • 2. Roadmap • Lectures 1-3: basic graphics • Lectures 4-6: basic data handling • Lectures 7-9: basic functions • The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together. Tuesday, 7 September 2010
  • 3. 1. Character subsetting 2. Sorting 3. Shortcuts 4. Iteration 5. (Optional extra: command line tips) Tuesday, 7 September 2010
  • 5. Your turn In pairs, try and recall the five types of subsetting we talked about last week. You have one minute! Tuesday, 7 September 2010
  • 6. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 7 September 2010
  • 7. # Matches by names diamonds[1:5, c("carat", "cut", "color")] # Useful technique: change labelling c("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut] # Can also be used to collapse levels table(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut]) # (see ?cut for continuous to discrete equivalent) Tuesday, 7 September 2010
  • 8. Sorting a data frame x <- c(2, 4, 3, 1) order(x) # means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th x[order(x)] # What does this do? diamonds[order(diamonds$price), ] Tuesday, 7 September 2010
  • 9. # Order by x, then y, then z order(diamonds$x, diamonds$y, diamonds$y) # Put in order of quality order(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity)) # desc sorts in descending order # also found in the plyr package x[order(x)] x[order(desc(x))] Tuesday, 7 September 2010
  • 10. Your turn Reorder the mpg dataset from most to least efficient. The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category. Tuesday, 7 September 2010
  • 11. Short cuts Tuesday, 7 September 2010
  • 12. Short cuts You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!) Four specific to data frames, one more generic. Tuesday, 7 September 2010
  • 13. Function Package subset base summarise plyr transform base arrange plyr plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded Tuesday, 7 September 2010
  • 14. # subset: short cut for subsetting zero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0 diamonds[zero_dim & !is.na(zero_dim), ] subset(diamonds, x == 0 | y == 0 | z == 0) # summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat)) biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat)) Tuesday, 7 September 2010
  • 15. # transform: short cut for adding new variables diamonds$volume <- diamonds$x * diamonds$y * diamonds$z diamonds$density <- diamonds$volume / diamonds$carat diamonds <- transform(diamonds, volume = x * y * z) diamonds <- transform(diamonds, density = volume / carat) # arrange: short cut for reordering diamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ] diamonds <- arrange(diamonds, price, desc(carat)) Tuesday, 7 September 2010
  • 16. # They all have similar syntax. The first argument # is a data frame, and all other arguments are # interpreted in the context of that data frame # (so you don't need to use data$ all the time) subset(df, subset) transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) arrange(df, var1, ...) # They all return a modified data frame. You still # have to save that to a variable if you want to # keep it Tuesday, 7 September 2010
  • 17. Your turn Use summarise, transform, subset and arrange to: Find all diamonds bigger than 3 carats and order from most expensive to cheapest. Add a new variable that estimates the diameter of the diamond (average of x and y). Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data? Tuesday, 7 September 2010
  • 18. Aside: never use attach! Non-local effects; not symmetric; implicit, not explicit. Makes it very easy to make mistakes. Use with() instead: with(bnames, table(year, length)) Tuesday, 7 September 2010
  • 19. # with is more general. Use in concert with other # functions, particularly those that don't have a data # argument diamonds$volume <- with(diamonds, x * y * z) # This won't work: with(diamonds, volume <- x * y * z) # with only changes lookup, not assignment Tuesday, 7 September 2010
  • 21. Stories Best data analyses tell a story, with a natural flow from beginning to end. For homeworks, try and come up with three plots that tell a story. Stories about a small sample of the data can work well. Tuesday, 7 September 2010
  • 22. qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) # Start by fixing incorrect values y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 diamonds$x[x_zero] <- NA diamonds$y[y_zero | y_big] <- NA diamonds$z[z_zero | z_big] <- NA Tuesday, 7 September 2010
  • 23. qplot(x, y, data = diamonds) # How can I get rid of those outliers? qplot(x, x - y, data = diamonds) qplot(x - y, data = diamonds, binwidth = 0.01) last_plot() + xlim(-0.5, 0.5) last_plot() + xlim(-0.2, 0.2) asym <- abs(diamonds$x - diamonds$y) > 0.2 diamonds_sym <- diamonds[!asym, ] # Did it work? qplot(x, y, data = diamonds_sym) qplot(x, x - y, data = diamonds_sym) # Something interesting is going on there! qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01)) Tuesday, 7 September 2010
  • 24. # What about x and z? qplot(x, z, data = diamonds_sym) qplot(x, x - z, data = diamonds_sym) # Subtracting doesn't work - z smaller than x and y qplot(x, x / z, data = diamonds_sym) # But better to log transform to make symmetrical qplot(x, log10(x / z), data = diamonds_sym) # and so on... Tuesday, 7 September 2010
  • 25. # How does symmetry relate to price? qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth() qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10)) diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y)) qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym) # Are asymmetric diamonds worth more? qplot(carat, price, data = diamonds_sym, colour = sym) qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F) Tuesday, 7 September 2010
  • 26. # Modelling summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym)) # But statistical significance != practical # significance sd(diamonds_sym$sym, na.rm = T) # [1] 0.02368828 # So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44) # 10 ^ -0.01 = 0.976 # So 1 sd increase in sym decreases price by ~2% Tuesday, 7 September 2010
  • 27. Command line Tuesday, 7 September 2010
  • 28. Why? Provenance & reproducibility. Working with remote servers. Automation & scripting. Common tools. Tuesday, 7 September 2010
  • 29. Basics pwd: the location of the current directory ls: the files in the current directory cd: change to another directory cd ..: change to parent directory cd ~: change to home directory mkdir: create a new directory Tuesday, 7 September 2010
  • 30. Your turn Create a directory for stat405. Inside that directory, create a directory for homework 2. Confirm that there are no files in that directory. Navigate back to your home directory. What other files are there? Tuesday, 7 September 2010