Data Science in R
By
Dr. Fiaz Gul Khan
COMSATS Abbottabad
Contents
• Basics of R Programming for Data Science
– Why learn R ?
– How to install R / R Studio ?
– How to install R packages ?
– Basic computations in R
• Essentials of R Programming
– Data Types and Objects in R
– Control Structures (Functions) in R
– Useful R Packages
Why learn R
• The style of coding is quite easy.
• It’s open source. No need to pay any subscription
charges.
• Availability of instant access to over 7800 packages
customized for various computation tasks.
• The community support is overwhelming. There
are numerous forums to help you out.
• Get high performance computing experience
( require packages)
Interface of R Studio
• R Console: This area shows the output of code you run. Also, you can
directly write codes in console. Code entered directly in R console
cannot be traced later. This is where R script comes to use.
• R Script: As the name suggest, here you get space to write codes. To
run those codes, simply select the line(s) of code and press Ctrl +
Enter. Alternatively, you can click on little ‘Run’ button location at top
right corner of R Script.
• R environment: This space displays the set of external elements
added. This includes data set, variables, vectors, functions etc. To
check if data has been loaded properly in R, always look at this area.
• Graphical Output: This space display the graphs created during
exploratory data analysis. Not just graphs, you could select packages,
seek help with embedded R’s official documentation.
How to install R Packages ?
• The sheer power of R lies in its incredible packages.
In R, most data handling tasks can be performed in 2
ways: Using R packages and R base functions.
• install.packages("package name")
• As a first time user, a pop might appear to select
your CRAN mirror (country server), choose
accordingly and press OK.
• Note: You can type this either in console directly and
press ‘Enter’ or in R script and click ‘Run’.
Basic Computations in R
• Let’s begin with basics. To get familiar with R coding environment, start
with some basic calculations. R console can be used as an interactive
calculator too. Type the following in your console:
• > 2 + 3
> 5
• > 6 / 3
> 2
• > (3*8)/(2*3)
> 4
• > log(12)
> 1.07
• > sqrt (121)
> 11
Basic Computations in R
• Similarly, you can experiment various combinations of calculations and get the
results. In case, you want to obtain the previous calculation, this can be done in
two ways. First, click in R console, and press ‘Up / Down Arrow’ key on your
keyboard. This will activate the previously executed commands. Press Enter.
• But, what if you have done too many calculations ? It would be too painful to
scroll through every command and find it out. In such situations, creating variable
is a helpful way.
• In R, you can create a variable using <- or = sign. Let’s say I want to create a
variable x to compute the sum of 7 and 8. I’ll write it as:
• > x <- 8 + 7
> x
> 15
• Once we create a variable, you no longer get the output directly (like calculator),
unless you call the variable in the next line. Remember, variables can be
alphabets, alphanumeric but not numeric. You can’t create numeric variables.
Essentials of R Programming
Objects
• R has five basic or ‘atomic’ classes of objects. Wait, what is an
object ? (like data type in programming)
• Everything you see or create in R is an object. A vector, matrix,
data frame, even a variable is an object. R treats it that way.
So, R has 5 basic ‘atomic’ classes of objects. This includes:
• Character
• Numeric (Real Numbers)
• Integer (Whole Numbers)
• Complex
• Logical (True / False)
Attributes/properties
• Since these classes are self-explanatory by names,
I wouldn’t elaborate on that. These classes have
attributes. Think of attributes as their ‘identifier’,
a name or number which aptly identifies them. An
object can have following attributes:
• names, dimension names
• dimensions
• class
• Length
Vectors
• For example: Let’s create vectors of different
classes. We can create vector using c() or
concatenate command also.
• > a <- c(1.8, 4.5) #numeric
> b <- c(1 + 2i, 3 - 6i) #complex
> d <- c(23, 44) #integer
> e <- vector("logical", length = 5)
• Similarly, you can create vector of various
classes.
Data Types/structures in R
• R has various type of ‘data types’/structures which includes vector (numeric,
integer etc), matrices, data frames and list. Let’s understand them one by
one.
• Vector: As mentioned above, a vector contains object of same class. But, you
can mix objects of different classes too. When objects of different classes are
mixed in a vector, coercion occurs. This effect causes the objects of different
types to ‘convert’ into one class. For example:
• > qt <- c("Time", 24, "October", TRUE, 3.33) #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character
• To check the class of any object, use class(“vector name”) function.
• > class(qt)
"character"
Converting class type
• To convert the class of a vector, you can use as. command.
• > bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
> as.character(bar)
> class(bar)
> "character“
• Similarly, you can change the class of any vector. But, you should pay
attention here. If you try to convert a “character” vector to “numeric” ,
NAs will be introduced. Hence, you should be careful to use this
command.
LIST
• A list is a special type of vector which contain elements of different
data types. For example:
• > my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
• [[1]]
[1] 22
• [[2]]
[1] "ab"
• [[3]]
[1] TRUE
• [[4]]
[1] 1+2i
LIST cont…
• As you can see, the output of a list is different
from a vector. This is because, all the objects
are of different types. The double bracket [[1]]
shows the index of first element and so on.
Hence, you can easily extract the element of
lists depending on their index. Like this:
• > my_list[[3]]
> [1] TRUE
MATRIX
• Matrices: When a vector is introduced
with row and column i.e. a dimension attribute, it becomes a
matrix. A matrix is represented by set of rows and columns. It
is a 2 dimensional data structure. It consist of elements of
same class. Let’s create a matrix of 3 rows and 2 columns:
• > my_matrix <- matrix(1:6, nrow=3, ncol=2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
MATRIX cont.
• > dim(my_matrix)
[1] 3 2
• > attributes(my_matrix)
$dim
[1] 3 2
• As you can see, the dimensions of a matrix can be obtained using
either dim() or attributes() command. To extract a particular element
from a matrix, simply use the index shown above. For example(try
this at your end):
• > my_matrix[,2] #extracts second column
> my_matrix[,1] #extracts first column
> my_matrix[2,] #extracts second row
> my_matrix[1,] #extracts first row
Matrix cont..
• As an interesting fact, you can also create a matrix from a vector. All
you need to do is, assign dimension dim() later. Like this:
• > age <- c(23, 44, 15, 12, 31, 16)
> age
[1] 23 44 15 12 31 16
> dim(age) <- c(2,3)
> age
[,1] [,2] [,3]
[1,] 23 15 31
[2,] 44 12 16
• > class(age)
[1] "matrix"
Matrix cont..
• You can also join two vectors using cbind() and rbind() functions. But, make sure
that both vectors have same number of elements. If not, it will return NA values.
• > x <- c(1, 2, 3, 4, 5, 6)
> y <- c(20, 30, 40, 50, 60)
> cbind(x, y)
> cbind(x, y)
x y
[1,] 1 20
[2,] 2 30
[3,] 3 40
[4,] 4 50
[5,] 5 60
[6,] 6 70
• > class(cbind(x, y))
[1] “matrix”
Data Frame
• Data Frame: This is the most commonly
used member of data types family. It is used to store
tabular data. It is different from matrix. In a matrix,
every element must have same class. But, in a data
frame, you can put list of vectors containing different
classes. This means, every column of a data frame
acts like a vector and every row as list. Every time
you will read data in R, it will be stored in the form of
a data frame. Hence, it is important to understand
the majorly used commands on data frame
Data Frame cont..
• > df <- data.frame(name = c("ash","jane","paul","mark"),
score = c(67,56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91
> dim(df)
[1] 4 2
Data Frame cont..
• > str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4
3
$ score: num 67 56 87 91
> nrow(df)
[1] 4
> ncol(df)
[1] 2
Data Frame cont..
• str() returns the structure of a data frame i.e. the list of variables
stored in the data frame.
• nrow() and ncol() return the number of rows and number of
columns in a data set respectively.
• Here you see “name” is a factor variable and “score” is
numeric. In data science, a variable can be categorized into two
types: Continuous and Categorical.
• Continuous variables are those which can take any form such as
1, 2, 3.5, 4.66 etc. Categorical variables are those which takes
only discrete values such as 2, 5, 11, 15 etc. In R, categorical
values are represented by factors. In df, name is a factor variable
having 4 unique levels.
Missing values
• Let’s now understand the concept of missing values in R. This is
one of the most painful yet crucial part of predictive modeling.
You must be aware of all techniques to deal with them.
• Missing values in R are represented by NA and NaN. Now we’ll
check if a data set has missing values (using the same data
frame df).
• > df[1:2,2] <- NA #injecting NA at 1st, 2nd row and 2nd column
of df
> df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91
Missing values cont…
• > is.na(df) #checks the entire data set for NAs
and return logical output
name score
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE FALSE
[4,] FALSE FALSE
> table(is.na(df)) #returns a table of logical
output
FALSE TRUE
6 2
Missing values cont…
• > df[!complete.cases(df),] #returns the list of rows having missing
values
name score
1 ash NA
2 jane NA
• Missing values hinder normal calculations in a data set. For example,
let’s say, we want to compute the mean of score. Since there are two
missing values, it can’t be done directly. Let’s see:
• mean(df$score)
[1] NA
> mean(df$score, na.rm = TRUE)
[1] 89
• The use of na.rm = TRUE parameter tells R to ignore the NAs and
compute the mean of remaining values in the selected column (score).
Missing values cont…
• To remove rows with NA values in a data
frame, you can use na.omit:
• > new_df <- na.omit(df)
> new_df
name score
3 paul 87
4 mark 91
Control Structures in R
• if, else – This structure is used to test a condition. Below is the syntax:
• if (<condition>){
##do something
} else {
##do something
}
• Example
• #initialize a variable
N <- 10
• #check if this variable * 5 is > 40
if (N * 5 > 40){
print("This is easy!")
} else {
print ("It's not easy!")
}
[1] "This is easy!"
Control Structures in R
cont….
• for – This structure is used when a loop is to be executed fixed number of times. It is
commonly used for iterating over the elements of an object (list, vector). Below is the syntax:
• for (<search condition>){
#do something
}
• Example
• #initialize a vector
y <- c(99,45,34,65,76,23)
• #print the first 4 numbers of this vector
for(i in 1:4){
print (y[i])
}
[1] 99
[1] 45
[1] 34
[1] 65
• Find max from this list?
Control Structures in R
cont….
• while –Statement
• #initialize a condition
Age <- 12
• #check if age is less than 17
while(Age < 17){
print(Age)
Age <- Age + 1 #Once the loop is executed, this code breaks the loop
}
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16

Introduction to R programming Language.pptx

  • 1.
    Data Science inR By Dr. Fiaz Gul Khan COMSATS Abbottabad
  • 2.
    Contents • Basics ofR Programming for Data Science – Why learn R ? – How to install R / R Studio ? – How to install R packages ? – Basic computations in R • Essentials of R Programming – Data Types and Objects in R – Control Structures (Functions) in R – Useful R Packages
  • 3.
    Why learn R •The style of coding is quite easy. • It’s open source. No need to pay any subscription charges. • Availability of instant access to over 7800 packages customized for various computation tasks. • The community support is overwhelming. There are numerous forums to help you out. • Get high performance computing experience ( require packages)
  • 4.
    Interface of RStudio • R Console: This area shows the output of code you run. Also, you can directly write codes in console. Code entered directly in R console cannot be traced later. This is where R script comes to use. • R Script: As the name suggest, here you get space to write codes. To run those codes, simply select the line(s) of code and press Ctrl + Enter. Alternatively, you can click on little ‘Run’ button location at top right corner of R Script. • R environment: This space displays the set of external elements added. This includes data set, variables, vectors, functions etc. To check if data has been loaded properly in R, always look at this area. • Graphical Output: This space display the graphs created during exploratory data analysis. Not just graphs, you could select packages, seek help with embedded R’s official documentation.
  • 6.
    How to installR Packages ? • The sheer power of R lies in its incredible packages. In R, most data handling tasks can be performed in 2 ways: Using R packages and R base functions. • install.packages("package name") • As a first time user, a pop might appear to select your CRAN mirror (country server), choose accordingly and press OK. • Note: You can type this either in console directly and press ‘Enter’ or in R script and click ‘Run’.
  • 7.
    Basic Computations inR • Let’s begin with basics. To get familiar with R coding environment, start with some basic calculations. R console can be used as an interactive calculator too. Type the following in your console: • > 2 + 3 > 5 • > 6 / 3 > 2 • > (3*8)/(2*3) > 4 • > log(12) > 1.07 • > sqrt (121) > 11
  • 8.
    Basic Computations inR • Similarly, you can experiment various combinations of calculations and get the results. In case, you want to obtain the previous calculation, this can be done in two ways. First, click in R console, and press ‘Up / Down Arrow’ key on your keyboard. This will activate the previously executed commands. Press Enter. • But, what if you have done too many calculations ? It would be too painful to scroll through every command and find it out. In such situations, creating variable is a helpful way. • In R, you can create a variable using <- or = sign. Let’s say I want to create a variable x to compute the sum of 7 and 8. I’ll write it as: • > x <- 8 + 7 > x > 15 • Once we create a variable, you no longer get the output directly (like calculator), unless you call the variable in the next line. Remember, variables can be alphabets, alphanumeric but not numeric. You can’t create numeric variables.
  • 9.
    Essentials of RProgramming Objects • R has five basic or ‘atomic’ classes of objects. Wait, what is an object ? (like data type in programming) • Everything you see or create in R is an object. A vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic ‘atomic’ classes of objects. This includes: • Character • Numeric (Real Numbers) • Integer (Whole Numbers) • Complex • Logical (True / False)
  • 10.
    Attributes/properties • Since theseclasses are self-explanatory by names, I wouldn’t elaborate on that. These classes have attributes. Think of attributes as their ‘identifier’, a name or number which aptly identifies them. An object can have following attributes: • names, dimension names • dimensions • class • Length
  • 11.
    Vectors • For example:Let’s create vectors of different classes. We can create vector using c() or concatenate command also. • > a <- c(1.8, 4.5) #numeric > b <- c(1 + 2i, 3 - 6i) #complex > d <- c(23, 44) #integer > e <- vector("logical", length = 5) • Similarly, you can create vector of various classes.
  • 12.
    Data Types/structures inR • R has various type of ‘data types’/structures which includes vector (numeric, integer etc), matrices, data frames and list. Let’s understand them one by one. • Vector: As mentioned above, a vector contains object of same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a vector, coercion occurs. This effect causes the objects of different types to ‘convert’ into one class. For example: • > qt <- c("Time", 24, "October", TRUE, 3.33) #character > ab <- c(TRUE, 24) #numeric > cd <- c(2.5, "May") #character • To check the class of any object, use class(“vector name”) function. • > class(qt) "character"
  • 13.
    Converting class type •To convert the class of a vector, you can use as. command. • > bar <- 0:5 > class(bar) > "integer" > as.numeric(bar) > class(bar) > "numeric" > as.character(bar) > class(bar) > "character“ • Similarly, you can change the class of any vector. But, you should pay attention here. If you try to convert a “character” vector to “numeric” , NAs will be introduced. Hence, you should be careful to use this command.
  • 14.
    LIST • A listis a special type of vector which contain elements of different data types. For example: • > my_list <- list(22, "ab", TRUE, 1 + 2i) > my_list • [[1]] [1] 22 • [[2]] [1] "ab" • [[3]] [1] TRUE • [[4]] [1] 1+2i
  • 15.
    LIST cont… • Asyou can see, the output of a list is different from a vector. This is because, all the objects are of different types. The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this: • > my_list[[3]] > [1] TRUE
  • 16.
    MATRIX • Matrices: Whena vector is introduced with row and column i.e. a dimension attribute, it becomes a matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2 columns: • > my_matrix <- matrix(1:6, nrow=3, ncol=2) > my_matrix [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6
  • 17.
    MATRIX cont. • >dim(my_matrix) [1] 3 2 • > attributes(my_matrix) $dim [1] 3 2 • As you can see, the dimensions of a matrix can be obtained using either dim() or attributes() command. To extract a particular element from a matrix, simply use the index shown above. For example(try this at your end): • > my_matrix[,2] #extracts second column > my_matrix[,1] #extracts first column > my_matrix[2,] #extracts second row > my_matrix[1,] #extracts first row
  • 18.
    Matrix cont.. • Asan interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension dim() later. Like this: • > age <- c(23, 44, 15, 12, 31, 16) > age [1] 23 44 15 12 31 16 > dim(age) <- c(2,3) > age [,1] [,2] [,3] [1,] 23 15 31 [2,] 44 12 16 • > class(age) [1] "matrix"
  • 19.
    Matrix cont.. • Youcan also join two vectors using cbind() and rbind() functions. But, make sure that both vectors have same number of elements. If not, it will return NA values. • > x <- c(1, 2, 3, 4, 5, 6) > y <- c(20, 30, 40, 50, 60) > cbind(x, y) > cbind(x, y) x y [1,] 1 20 [2,] 2 30 [3,] 3 40 [4,] 4 50 [5,] 5 60 [6,] 6 70 • > class(cbind(x, y)) [1] “matrix”
  • 20.
    Data Frame • DataFrame: This is the most commonly used member of data types family. It is used to store tabular data. It is different from matrix. In a matrix, every element must have same class. But, in a data frame, you can put list of vectors containing different classes. This means, every column of a data frame acts like a vector and every row as list. Every time you will read data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on data frame
  • 21.
    Data Frame cont.. •> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91)) > df name score 1 ash 67 2 jane 56 3 paul 87 4 mark 91 > dim(df) [1] 4 2
  • 22.
    Data Frame cont.. •> str(df) 'data.frame': 4 obs. of 2 variables: $ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3 $ score: num 67 56 87 91 > nrow(df) [1] 4 > ncol(df) [1] 2
  • 23.
    Data Frame cont.. •str() returns the structure of a data frame i.e. the list of variables stored in the data frame. • nrow() and ncol() return the number of rows and number of columns in a data set respectively. • Here you see “name” is a factor variable and “score” is numeric. In data science, a variable can be categorized into two types: Continuous and Categorical. • Continuous variables are those which can take any form such as 1, 2, 3.5, 4.66 etc. Categorical variables are those which takes only discrete values such as 2, 5, 11, 15 etc. In R, categorical values are represented by factors. In df, name is a factor variable having 4 unique levels.
  • 24.
    Missing values • Let’snow understand the concept of missing values in R. This is one of the most painful yet crucial part of predictive modeling. You must be aware of all techniques to deal with them. • Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values (using the same data frame df). • > df[1:2,2] <- NA #injecting NA at 1st, 2nd row and 2nd column of df > df name score 1 ash NA 2 jane NA 3 paul 87 4 mark 91
  • 25.
    Missing values cont… •> is.na(df) #checks the entire data set for NAs and return logical output name score [1,] FALSE TRUE [2,] FALSE TRUE [3,] FALSE FALSE [4,] FALSE FALSE > table(is.na(df)) #returns a table of logical output FALSE TRUE 6 2
  • 26.
    Missing values cont… •> df[!complete.cases(df),] #returns the list of rows having missing values name score 1 ash NA 2 jane NA • Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since there are two missing values, it can’t be done directly. Let’s see: • mean(df$score) [1] NA > mean(df$score, na.rm = TRUE) [1] 89 • The use of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining values in the selected column (score).
  • 27.
    Missing values cont… •To remove rows with NA values in a data frame, you can use na.omit: • > new_df <- na.omit(df) > new_df name score 3 paul 87 4 mark 91
  • 28.
    Control Structures inR • if, else – This structure is used to test a condition. Below is the syntax: • if (<condition>){ ##do something } else { ##do something } • Example • #initialize a variable N <- 10 • #check if this variable * 5 is > 40 if (N * 5 > 40){ print("This is easy!") } else { print ("It's not easy!") } [1] "This is easy!"
  • 29.
    Control Structures inR cont…. • for – This structure is used when a loop is to be executed fixed number of times. It is commonly used for iterating over the elements of an object (list, vector). Below is the syntax: • for (<search condition>){ #do something } • Example • #initialize a vector y <- c(99,45,34,65,76,23) • #print the first 4 numbers of this vector for(i in 1:4){ print (y[i]) } [1] 99 [1] 45 [1] 34 [1] 65 • Find max from this list?
  • 30.
    Control Structures inR cont…. • while –Statement • #initialize a condition Age <- 12 • #check if age is less than 17 while(Age < 17){ print(Age) Age <- Age + 1 #Once the loop is executed, this code breaks the loop } [1] 12 [1] 13 [1] 14 [1] 15 [1] 16