SlideShare a Scribd company logo
An introduction to R 
Version 1.4 
November 2014 
h=p://datasciencebusiness.wordpress.com 
Prof. 
Richard 
Vidgen 
Management 
Systems 
Hull 
University 
Business 
School 
E: 
r.vidgen@hull.ac.uk
Aims 
and 
use 
of 
this 
presentaDon 
This 
presentaDon 
provides 
an 
introducDon 
to 
R 
for 
anyone 
who 
wants 
to 
get 
an 
idea 
of 
what 
R 
is 
and 
what 
it 
can 
do. 
Although 
no 
prior 
experience 
of 
R 
is 
needed, 
a 
basic 
understanding 
of 
data 
analysis 
(e.g., 
mulDple 
regression) 
is 
assumed, 
as 
is 
a 
basic 
technical 
competence 
(e.g., 
installing 
soKware 
and 
managing 
directory 
structures). 
If 
you 
are 
already 
using 
SPSS 
you 
will 
get 
a 
feel 
for 
how 
it 
compares 
with 
R. 
It’s 
a 
work 
in 
progress 
and 
will 
be 
updated 
based 
on 
experience 
and 
feedback. 
Docendo 
discimus, 
(LaDn 
"by 
teaching, 
we 
learn").
Contents 
• PredicDve 
analyDcs 
and 
analyDcs 
tools 
• An 
overview 
of 
R 
• Installing 
R 
and 
an 
R 
IDE 
(integrated 
development 
environment) 
• R 
syntax 
and 
data 
types 
• MulDple 
regression 
in 
R 
and 
SPSS 
• The 
Twi=eR 
package 
• The 
Rfacebook 
package
Resources 
• The 
R 
source 
and 
data 
files 
for 
this 
tutorial 
can 
be 
accessed 
at: 
– h=p://datasciencebusiness.wordpress.com 
• R_intro.R 
• R_regression.R 
• R_twi=er.R 
• R_twi=er_senDment.R 
• senDment.R 
• R_facebook.R 
• insurance.csv 
• twi=erHull.csv 
• posiDve-­‐words.txt 
• negaDve-­‐words.txt 
WARNING: 
when 
R 
packages 
are 
updated 
by 
developers 
things 
can 
break 
and 
your 
R 
programs 
may 
stop 
working. 
The 
code 
in 
the 
files 
above 
is 
tested 
regularly 
with 
the 
latest 
packages 
and 
updated 
as 
necessary
Predictive analytics
Be=er 
decisions 
-­‐ 
predicDve 
analyDcs 
• A 
predicDve 
model 
that 
calculates 
strawberry 
purchases 
based 
on: 
– Weather 
forecast 
– Store 
temperature 
– Freezer 
sensor 
data 
– Remaining 
stock 
per 
shelf 
life 
– Sales 
transacDon 
point 
of 
sale 
feeds 
– Web 
searches, 
social 
menDons 
h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
PredicDve 
analyDcs 
• For 
example, 
what 
data 
might 
help 
us 
predict 
which 
students 
will 
drop 
out? 
– Assessment 
grades 
at 
University 
– Prior 
educaDon 
a=ainment 
– Social 
background 
– Distance 
of 
home 
from 
University 
– Friendship 
circles 
and 
networks 
(e.g., 
sports 
club 
memberships) 
– A=endance 
at 
lectures 
and 
tutorials 
– InteracDon 
in 
lectures 
and 
tutorials 
– Time 
spent 
on 
campus 
– Time 
spent 
in 
library 
– Number 
of 
accesses 
to 
electronic 
learning 
resources 
– Text 
books 
purchased 
– Engagement 
in 
subject-­‐related 
forums 
– SenDment 
of 
social 
media 
posts 
– Etc.
h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
Some 
of 
the 
techniques 
data 
scienDsts 
use 
• ClassificaDon 
• Clustering 
• AssociaDon 
rules 
• Decision 
trees 
• Regression 
• GeneDc 
algorithms 
• Neural 
networks 
and 
support 
vector 
machines 
• Machine 
learning 
• Natural 
language 
processing 
• SenDment 
analysis 
• ArDficial 
intelligence 
• Time 
series 
analysis 
• SimulaDons 
• Social 
network 
analysis
Technologies 
for 
data 
analysis: 
usage 
rates 
King, 
J., 
& 
R. 
Magoulas 
(2013). 
Data 
Science 
Salary 
Survey. 
O’Reilly 
Media. 
R 
and 
Python 
programming 
languages 
come 
above 
Excel 
Enterprise 
products 
bo=om 
of 
the 
heap
Data 
scienDst 
as 
“bricoleur” 
“In 
the 
pracDcal 
arts 
and 
the 
fine 
arts, 
bricolage 
(French 
for 
"Dnkering") 
is 
the 
construcDon 
or 
creaDon 
of 
a 
work 
from 
a 
diverse 
range 
of 
things 
that 
happen 
to 
be 
available, 
or 
a 
work 
created 
by 
such 
a 
process.” 
Wikipedia
The R environment
What 
is 
R? 
R 
is 
an 
open 
source 
computer 
language 
used 
for 
data 
manipulaDon, 
staDsDcs, 
and 
graphics.
History 
of 
R 
• 1976 
– 
Bell 
Labs 
develops 
S, 
a 
language 
for 
data 
analysis; 
released 
commercially 
as 
S-­‐plus 
• 1990s 
– 
R 
wri=en 
and 
released 
as 
open 
source 
by 
(R)oss 
Ihaka 
and 
(R)obert 
Gentleman 
• 1997 
– 
The 
Comprehensive 
R 
Archive 
Network 
(CRAN) 
launched 
• August 
2014 
– 
CRAN 
repository 
contains 
5789 
user-­‐contributed 
packages
Benefits 
of 
R 
• It’s 
free! 
• Runs 
on 
mulDple 
plaqorms 
(Windows, 
Unix, 
MacOS) 
• ValidaDon/replicaDon 
of 
analyses 
(assumes 
commented 
code 
and 
documentaDon) 
• Long 
term 
efficiency 
(using 
the 
same 
code 
for 
mulDple 
projects)
SPSS* 
vs 
R 
SPSS 
• Limited 
ability 
for 
data 
scienDst 
to 
change 
the 
environment 
• Data 
scienDst 
relies 
on 
algorithms 
developed 
by 
SPSS 
• Problem-­‐solving 
constrained 
by 
SPSS 
developers 
• Must 
pay 
for 
using 
the 
constrained 
algorithms 
R 
• Can 
use 
funcDons 
made 
by 
a 
global 
community 
of 
staDsDcs 
researchers 
or 
create 
their 
own 
• Almost 
unlimited 
in 
their 
ability 
to 
change 
their 
environment 
• Can 
do 
things 
SPSS 
users 
cannot 
even 
dream 
of 
• Get 
all 
this 
for 
free 
*or 
any 
other 
proprietary 
closed 
soKware 
system
h=p://www.r-­‐project.org 
Install 
R 
from 
here
The 
R 
console
2. 
Output 
appears 
here 
1. 
Type 
in 
commands, 
select 
the 
text 
and 
run 
with 
(cmd 
+ 
return) 
or 
menu 
opDon: 
edit 
| 
execute
R 
integrated 
development 
environments 
(IDEs) 
• Some 
free 
IDEs 
– RevoluDon 
R 
Enterprise 
– Architect 
– R 
Studio 
• Most 
widely 
used 
R 
IDE 
• It’s 
simple 
and 
intuiDve 
• Used 
to 
build 
this 
tutorial
RevoluDon 
analyDcs 
h=p://www.revoluDonanalyDcs.com
Architect 
h=p://www.openanalyDcs.eu
R 
Studio 
h=p://www.rstudio.com 
Install 
R 
Studio 
from 
here
R 
Studio 
Type 
code 
here 
Results 
appear 
here 
Environment 
and 
history 
Packages, 
plots, 
files
The R language
Basic 
grammar 
of 
R 
object 
= 
funcDon(arguments)
Guess 
what 
this 
does 
Z 
<-­‐ 
read.table(“MyFile.txt”)
Two 
ways 
of 
doing 
it 
= 
is 
the 
same 
as 
<-­‐
Gezng 
help 
help(getwd)
Reading 
the 
slides 
# 
Comments 
are 
in 
blue 
<-­‐ 
Code 
is 
in 
green 
Output 
is 
in 
black
Set 
the 
working 
directory 
# 
For 
the 
tutorial, 
load 
all 
the 
R 
program 
and 
data 
files 
provided 
(see 
Resources 
slide) 
# 
into 
a 
directory 
of 
your 
choice 
# 
Set 
your 
working 
directory 
to 
this 
directory, 
e.g., 
for 
a 
Mac 
setwd("/Users/ 
… 
somewhere 
on 
your 
computer 
… 
/R_tutorial") 
# 
and 
for 
Windows 
Setwd("C:/ 
… 
somewhere 
on 
your 
computer 
… 
/R_tutorial") 
# 
List 
the 
files 
in 
the 
directory 
list.files() 
> 
list.files() 
[1] 
"insurance.csv" 
"negaDve-­‐words.txt" 
"posiDve-­‐words.txt" 
[4] 
"R_facebook.R" 
"R_intro.R" 
"R_regression.R" 
[7] 
"R_twi=er_senDment.R" 
"R_twi=er.R" 
"senDment.R" 
[10] 
"twi=erHull.csv"
Data 
types 
and 
data 
structures 
Data 
types 
Numeric 
Character 
Logical 
Data 
structures 
Vectors 
Lists 
MulD-­‐dimensional 
Matrices 
Dataframes
Vectors 
and 
data 
classes 
Values 
can 
be 
combined 
into 
vectors 
using 
the 
c() 
funcDon 
# this is a comment! 
num.var <- c(1, 2, 3, 4) # numeric vector! 
char.var <- c("1", "2", "3", "4") # character vector! 
log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector! 
Vectors 
have 
a 
class 
which 
determines 
how 
funcDons 
treat 
them 
> class(num.var)! 
[1] "numeric"! 
> class(char.var)! 
[1] "character"! 
> class(log.var)! 
[1] "logical"!
Vectors 
and 
data 
classes 
Can 
calculate 
mean 
of 
a 
numeric 
vector, 
but 
not 
of 
a 
character 
vector 
> mean(num.var)! 
[1] 2.5! 
> mean(char.var)! 
[1] NA! 
Warning message:! 
In mean.default(char.var) :! 
argument is not numeric or 
logical: returning NA!
Lists 
# 
create 
a 
list 
-­‐ 
a 
collecDon 
of 
vectors 
employees 
<-­‐ 
c("John", 
"Sunil", 
"Anna") 
yearsService 
<-­‐ 
c(3, 
2, 
6) 
empDetails 
<-­‐ 
list(employees, 
yearsService) 
class(empDetails) 
empDetails 
> 
class(empDetails) 
[1] 
"list" 
> 
empDetails 
[[1]] 
[1] 
"John" 
"Sunil" 
"Anna" 
[[2]] 
[1] 
3 
2 
6
Dataframes 
A 
data.frame 
is 
a 
list 
of 
vectors, 
each 
of 
the 
same 
length 
DF <- data.frame(x=1:5, 
y=letters[1:5], z=letters[6:10])! 
> DF # data.frame with 3 columns and 
5 rows! 
x y Z! 
1 1 a f! 
2 2 b g! 
3 3 c h! 
4 4 d i! 
5 5 e j!
Multiple regression in R
insurance.csv 
• insurance.csv 
contains 
medical 
expenses 
for 
paDents 
enrolled 
in 
a 
healthcare 
plan 
• The 
data 
file 
contains 
1,338 
cases 
with 
features 
of 
the 
paDent 
as 
well 
as 
the 
total 
medical 
expenses 
charged 
to 
the 
paDent’s 
healthcare 
plan 
for 
the 
calendar 
year 
• There 
are 
no 
missing 
values 
(these 
would 
be 
shown 
as 
NA 
in 
R 
indicaDng 
empty 
or 
null)
insurance.csv 
Variable 
Descrip=on 
age 
an 
integer 
indicaDng 
the 
age 
of 
the 
beneficiary 
sex 
Either 
“male” 
or 
“female” 
bmi 
body 
mass 
index 
(BMI), 
which 
gives 
an 
indicaDon 
of 
how 
over 
or 
under-­‐ 
weight 
a 
person 
is. 
BMI 
is 
calculated 
as 
weight 
in 
kilograms 
divideD 
by 
height 
in 
metres 
squared. 
An 
ideal 
BMI 
is 
in 
the 
range 
18.5 
to 
24.9 
children 
an 
integer 
showing 
the 
number 
of 
children/dependents 
covered 
by 
the 
plan 
smoker 
“yes” 
or 
“no” 
region 
the 
beneficiary’s 
place 
of 
residence, 
divided 
into 
four 
regions: 
“northeast”, 
“southeast”, 
“southwest”, 
or 
“northwest”. 
This 
example 
is 
taken 
from 
Lantz 
(2013)
Read 
the 
data 
insurance 
<-­‐ 
read.csv("insurance.csv", 
stringsAsFactors 
= 
TRUE) 
head(insurance) 
age sex bmi children smoker region charges! 
1 19 female 27.900 0 yes southwest 16884.924! 
2 18 male 33.770 1 no southeast 1725.552! 
3 28 male 33.000 3 no southeast 4449.462! 
4 33 male 22.705 0 no northwest 21984.471! 
5 32 male 28.880 0 no northwest 3866.855! 
6 31 female 25.740 0 no southeast 3756.622!
Working 
with 
data 
• There 
are 
specific 
funcDons 
for 
reading 
from 
(and 
wriDng 
to) 
excel, 
SPSS, 
SAS, 
etc. 
• However, 
the 
simplest 
way 
is 
to 
export 
and 
import 
files 
in 
csv 
(comma 
separated 
values) 
format 
– 
the 
lingua 
franca 
of 
data
Explore 
the 
data 
summary(insurance$charges)! 
Min. 1st Qu. Median Mean 3rd Qu. Max. ! 
1122 4740 9382 13270 16640 63770 ! 
table(insurance$region)! 
northeast northwest southeast southwest ! 
324 325 364 325 ! 
cor(insurance[c("age", "bmi", "children", 
"charges")])! 
age bmi children charges! 
age 1.0000000 0.1092719 0.04246900 0.29900819! 
bmi 0.1092719 1.0000000 0.01275890 0.19834097! 
children 0.0424690 0.0127589 1.00000000 0.06799823! 
charges 0.2990082 0.1983410 0.06799823 1.00000000!
Visualise 
the 
data 
hist(insurance$charges)
Visualise 
the 
data 
pairs(insurance[c("age", 
"bmi", 
"children", 
"charges")])
Visualise 
the 
data 
-­‐ 
be=er 
library(psych) 
pairs.panels(insurance[c("age", 
"bmi", 
"children", 
"charges")], 
hist.col="yellow”)
Installing 
packages 
• pairs.panels 
is 
a 
funcDon 
in 
the 
psych 
package, 
which 
needs 
to 
be 
installed:
MulDple 
regression 
1 
ins_model1 <- lm(charges ~ age + 
children + bmi, data = insurance)! 
Residuals:! 
Min 1Q Median 3Q Max ! 
-13884 -6994 -5092 7125 48627 ! 
! 
Coefficients:! 
11.8% 
of 
variaDon 
in 
insurance 
charges 
is 
explained 
by 
the 
model 
Estimate Std. Error t value Pr(>|t|) ! 
(Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***! 
age 239.99 22.29 10.767 < 2e-16 ***! 
children 542.86 258.24 2.102 0.0357 * ! 
bmi 332.08 51.31 6.472 1.35e-10 ***! 
---! 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! 
! 
Residual standard error: 11370 on 1334 degrees of freedom! 
Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 ! 
F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!
SPSS
R tutorial
MulDple 
regression 
2 
ins_model2 <- lm(charges ~ age + children + bmi + 
sex + smoker + region, data = insurance)! 
Residuals:! 
Min 1Q Median 3Q Max ! 
-11304.9 -2848.1 -982.1 1393.9 29992.8 ! 
! 
Coefficients:! 
74.9% 
of 
variaDon 
in 
insurance 
charges 
is 
explained 
by 
the 
model 
Estimate Std. Error t value Pr(>|t|) ! 
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***! 
age 256.9 11.9 21.587 < 2e-16 ***! 
children 475.5 137.8 3.451 0.000577 ***! 
bmi 339.2 28.6 11.860 < 2e-16 ***! 
sexmale -131.3 332.9 -0.394 0.693348 ! 
smokeryes 23848.5 413.1 57.723 < 2e-16 ***! 
regionnorthwest -353.0 476.3 -0.741 0.458769 ! 
regionsoutheast -1035.0 478.7 -2.162 0.030782 * ! 
regionsouthwest -960.0 477.9 -2.009 0.044765 * ! 
---! 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! 
! 
Residual standard error: 6062 on 1329 degrees of freedom! 
Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 ! 
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!
Dummy 
variables 
• R 
coded 
character 
vectors 
as 
factors 
and 
automaDcally 
analysed 
them 
as 
dummy 
variables 
– stringsAsFactors 
= 
TRUE 
• In 
SPSS 
these 
need 
to 
be 
coded 
by 
hand
SPSS 
-­‐ 
create 
dummy 
variables 
• Recode 
sex 
• 0 
= 
female 
• 1 
= 
male 
• Recode 
smoker 
– 1 
= 
yes 
– 0 
= 
no 
• Recode 
region 
– Number 
of 
dummies 
= 
number 
of 
groups 
– 
1 
– = 
4 
– 
1 
= 
3 
• northeast 
= 
0, 
0, 
0 
• northwest 
= 
1, 
0, 
0 
• southeast 
= 
0, 
1, 
0 
• southwest 
= 
0, 
0, 
1
Recode
Recode
Paste 
the 
syntax
Data 
set 
with 
all 
dummy 
coded 
variables 
created* 
*aKer 
quite 
a 
bit 
of 
work!
R tutorial
Mining Twitter with R
Twi=eR 
• Install 
the 
Twi=eR 
package 
to 
access 
the 
Twi=er 
API 
• Before 
you 
can 
access 
Twi=er 
from 
R 
you 
have 
to: 
– Sign 
up 
for 
a 
Twi=er 
developer 
account 
– Create 
a 
Twi=er 
app 
and 
copy 
the 
authenDcaDon 
details
Twi=er 
authenDcaDon 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
For 
details 
of 
how 
to 
authenDcate 
and 
set 
up 
Twi=eR: 
h=p://thinktostart.wordpress.com/2013/05/22/twi=er-­‐authenDficaDon-­‐with-­‐r/
Twi=er 
authenDcaDon 
############################################# 
# 
1 
-­‐ 
AuthenDcate 
with 
twi=er 
API 
############################################# 
library(twi=eR) 
library(ROAuth) 
api_key 
<-­‐ 
”xxxxxxxxxxxxxxxxxxYsnu5NM" 
api_secret 
<-­‐ 
"wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" 
access_token 
<-­‐ 
"581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" 
access_token_secret 
<-­‐ 
"tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" 
From 
your 
Twi=er 
applicaDon 
sezngs 
setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)
Retrieve 
Twi=er 
account 
details 
############################################################ 
# 
2 
-­‐ 
Get 
details 
of 
Hull 
Uni 
Twi=er 
account 
############################################################ 
twitacc 
<-­‐ 
getUser('UniOfHull') 
twitacc$getDescripDon() 
twitacc$getFollowersCount() 
friends 
= 
twitacc$getFriends(10) 
# 
limit 
the 
number 
of 
friends 
returned 
to 
10 
friends 
> 
twitacc$getDescripDon() 
[1] 
"Our 
official 
Twi=er 
feed 
featuring 
the 
latest 
news 
and 
events 
from 
the 
University 
of 
Hull" 
> 
twitacc$getFollowersCount() 
[1] 
20025 
> 
friends 
= 
twitacc$getFriends(10) 
> 
friends 
$`2644622095` 
[1] 
"HullNursing" 
$`1536529304` 
[1] 
"HullSimulaDon”
Trace 
the 
network 
net 
<-­‐ 
getUser(friends[8]) 
# 
Sheridansmith1 
is 
friend 
no. 
8 
net$getDescripDon() 
net$getFollowersCount() 
net$getFriends(n=10) 
> 
net$getDescripDon() 
[1] 
"Sister 
of 
@damiandsmith 
of 
that 
there 
band 
@_TheTorn 
:) 
x" 
> 
net$getFollowersCount() 
[1] 
502167 
> 
net$getFriends(n=10) 
$`88598283` 
[1] 
"overnightstv" 
$`423373477` 
[1] 
"IsleLoseIt" 
$`19650489` 
[1] 
"donneriron"
R tutorial
Get 
tweets 
and 
write 
to 
file 
############################################################ 
# 
3 
-­‐ 
Search 
Twi=er 
############################################################ 
Twi=er.list 
<-­‐ 
searchTwi=er('@UniOfHull', 
n=500) 
# 
limited 
to 
500 
tweets 
for 
demo 
Twi=er.df 
= 
twListToDF(Twi=er.list) 
write.csv(Twi=er.df, 
file='twi=er.csv', 
row.names=F)
Tweet 
data 
read 
into 
Excel
Tweet 
text 
@UniOfHull
Analyse 
the 
tweets: 
the 
tm 
package 
# 
read 
the 
twi=er 
data 
into 
a 
data 
frame 
(use 
previously 
stored 
.csv 
to 
# 
replicate 
the 
twi=er 
analysis 
presented 
here) 
tweet_raw 
<-­‐ 
read.csv(“twi=erHull.csv", 
stringsAsFactors 
= 
FALSE) 
# 
remove 
the 
non-­‐text 
characters 
tweet_raw$text 
<-­‐ 
gsub("[^[:alnum:]///' 
]", 
"", 
tweet_raw$text) 
# 
build 
a 
corpus, 
which 
is 
a 
collecDon 
of 
text 
documents 
# 
VectorSource 
specifies 
that 
the 
source 
is 
a 
character 
vector 
library(tm) 
myCorpus 
<-­‐ 
Corpus(VectorSource(tweet_raw$text))
Inspect 
the 
corpus 
# 
examine 
the 
sms 
corpus 
inspect(myCorpus[1:3]) 
[[1]] 
Have 
you 
been 
to 
the 
UniOfHull 
popup 
campus 
in 
Leeds 
yet 
It's 
at 
the 
White 
Cloth 
Gallery 
Aire 
Street 
for 
all 
your 
Clearing2014 
queries 
[[2]] 
RT 
UniOfHull 
In 
Newcastle 
and 
looking 
for 
Clearing2014 
advice 
Head 
to 
the 
UniOfHull 
popup 
campus 
at 
Newcastle 
Arts 
Centre 
on 
Westgate 
[[3]] 
RT 
Bethanyn96 
Got 
into 
unio‡ull 
so 
happy
Clean 
corpus 
and 
create 
a 
wordcloud 
# 
clean 
up 
the 
corpus 
using 
tm_map() 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
PlainTextDocument) 
corpus_clean 
<-­‐ 
tm_map(myCorpus, 
tolower) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeNumbers) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeWords, 
stopwords()) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removePunctuaDon) 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
stripWhitespace) 
# 
remove 
any 
words 
not 
wanted 
in 
the 
wordcloud 
corpus_clean 
<-­‐ 
tm_map(corpus_clean, 
removeWords, 
"unio‡ull") 
# 
create 
a 
wordcloud 
library(wordcloud) 
wordcloud(corpus_clean, 
min.freq 
= 
10, 
random.order 
= 
FALSE, 
colors=brewer.pal(8, 
"Dark2"))
August 
2014
SenDment 
analysis 
• What 
is 
the 
senDment 
of 
the 
tweets? 
• How 
does 
the 
number 
of 
posiDve 
words 
compare 
with 
the 
number 
of 
negaDve 
words? 
• The 
number 
of 
posiDve 
words 
minus 
the 
number 
of 
negaDve 
words 
gives 
a 
rough 
indicaDon 
of 
the 
“senDment” 
of 
the 
tweet 
• The 
posiDve 
and 
negaDve 
words 
are 
taken 
form 
a 
word 
list 
developed 
by 
Hu 
and 
Liu: 
– h=p://www.cs.uic.edu/~liub/FBS/opinion-­‐lexicon-­‐ 
English.rar
SenDment 
analysis 
# 
load 
libraries 
library(plyr) 
library(ggplot2) 
# 
load 
the 
score.senDment() 
funcDon 
– 
see 
appendix 
A 
for 
code 
source( 
'senDment.R' 
) 
# 
read 
the 
tweets 
saved 
previously 
hull.tweets 
<-­‐ 
read.csv(“twi=erHull.csv", 
stringsAsFactors 
= 
FALSE) 
# 
read 
the 
lists 
of 
pos 
and 
neg 
words 
from 
Hu 
& 
Liu 
hu.liu.pos 
= 
scan('posiDve-­‐words.txt', 
what='character', 
comment.char=';') 
hu.liu.neg 
= 
scan('negaDve-­‐words.txt', 
what='character', 
comment.char=';')
SenDment 
analysis 
# 
extract 
the 
text 
of 
the 
tweets 
and 
pass 
to 
the 
senDment 
funcDon 
for 
# 
scoring 
(see 
Appendix 
A 
for 
R 
code) 
hull.text 
= 
hull.tweets$text 
hull.scores 
= 
score.senDment(hull.text, 
pos.words, 
neg.words, 
.progress='text') 
# 
make 
a 
histogram 
of 
the 
scores 
ggplot(hull.scores, 
aes(x=score)) 
+ 
geom_histogram(binwidth=1, 
colour="black", 
fill="lightblue") 
+ 
xlab("SenDment 
score") 
+ 
ylab("Frequency") 
+ 
ggDtle("TWITTER: 
Hull 
University 
SenDment 
Analysis")
Number 
of 
posiDve 
word 
matches 
minus 
the 
number 
of 
negaDve 
word 
matches
WriDng 
the 
data 
out 
for 
text 
analysis 
• Many 
text 
analysis 
packages 
require 
each 
comment 
to 
be 
in 
a 
separate 
file 
• If 
the 
data 
is 
in 
Excel 
or 
SPSS 
it 
will 
be 
cumbersome 
to 
generate 
the 
files 
manually 
• Write 
R 
code 
instead
Create 
an 
output 
directory 
# 
create 
an 
output 
directory 
for 
the 
txt 
files 
if 
it 
does 
not 
exist 
mainDir 
<-­‐ 
getwd() 
subDir 
<-­‐ 
"outputText" 
if 
(file.exists(subDir)){ 
setwd(file.path(mainDir, 
subDir)) 
} 
else 
{ 
dir.create(file.path(mainDir, 
subDir)) 
setwd(file.path(mainDir, 
subDir)) 
}
Loop 
to 
write 
the 
files 
# 
find 
out 
how 
many 
rows 
in 
the 
data.frame 
tweets 
= 
nrow(Twi=er.df) 
# 
loop 
to 
write 
the 
txt 
files 
for 
(tweet 
in 
1:tweets) 
{ 
tweetText 
= 
Twi=er.df[tweet, 
1] 
filename 
= 
paste("output", 
tweet, 
".txt", 
sep 
= 
"") 
writeLines(tweetText, 
con 
= 
filename) 
} 
Note 
that 
R 
has 
iterators 
that 
oKen 
remove 
the 
need 
to 
write 
loops, 
see 
the 
“apply” 
family 
of 
funcDons
R tutorial
Accessing Facebook with R
Rfacebook 
## 
loading 
libraries 
library(Rfacebook) 
library(igraph) 
# 
get 
your 
token 
from 
'h=ps://developers.facebook.com/tools/explorer/' 
# 
make 
sure 
you 
give 
permissions 
to 
access 
your 
list 
of 
friends 
# 
set 
your 
FB 
token 
token 
<-­‐ 
“xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx” 
For 
details 
of 
how 
to 
authenDcate 
and 
use 
Rfacebook: 
h=p://pablobarbera.com/blog/archives/3.html 
Get 
the 
access 
token 
here: 
h=ps://developers.facebook.com/tools/explorer
Get 
the 
data 
and 
graph 
the 
network 
# 
download 
adjacency 
matrix 
for 
network 
of 
Facebook 
friends 
my_network 
<-­‐ 
getNetwork(token, 
format="adj.matrix") 
# 
friends 
who 
are 
friends 
with 
me 
alone 
singletons 
<-­‐ 
rowSums(my_network)==0 
# 
graph 
the 
network 
my_graph 
<-­‐ 
graph.adjacency(my_network[!singletons,! 
singletons]) 
layout 
<-­‐ 
layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) 
plot(my_graph, 
vertex.size=2, 
#vertex.label=NA, 
vertex.label.cex=0.5, 
edge.arrow.size=0, 
edge.curved=TRUE,layout=layout)
Facebook 
network 
graph 
Ok, 
it’s 
ugly 
– 
there 
are 
plenty 
more 
social 
network 
analysis 
and 
graphing 
packages 
in 
R 
to 
try, 
or 
you 
can 
write 
a 
bit 
of 
code 
to 
export 
the 
adjacency 
matrix 
to 
another 
package, 
e.g., 
UCINET/Netdraw, 
Pajek, 
Gephi
write.graph(graph 
= 
my_graph, 
file 
= 
'‹.gml', 
format 
= 
'gml') 
VisualizaDon 
in 
Gephi
To 
find 
out 
what 
funcDons 
a 
package 
supports, 
what 
they 
do 
and 
how 
to 
call 
them 
see 
the 
documentaDon
Beg, 
borrow, 
steal 
code! 
• Don’t 
bother 
wriDng 
code 
from 
scratch 
• If 
you 
want 
to 
know 
how 
to 
do 
something 
then 
Google 
it 
• There 
will 
likely 
be 
a 
soluDon 
that 
you 
can 
scrape 
off 
the 
screen 
and 
modify 
for 
your 
own 
purposes 
• For 
example 
– How 
would 
you 
remove 
rows 
from 
a 
dataframe 
with 
missing 
values 
(NA)? 
– Try 
“r 
how 
to 
remove 
missing 
values” 
in 
Google
What’s 
next? 
• Access 
data 
held 
in 
SQL 
databases 
and 
store 
your 
own 
data, 
e.g., 
MySQL, 
and 
the 
packages 
• Read, 
write, 
manipulate 
Excel 
spreadsheets 
using 
xls 
and 
XLConnect 
• Access 
maps, 
e.g., 
Google 
Maps, 
and 
overlay 
locaDon 
data 
(e.g., 
Tweets) 
on 
a 
map 
• Screen 
scrape 
Web 
sites 
that 
don’t 
have 
APIs 
(e.g., 
Google 
Scholar)
Suggested 
further 
reading 
and 
resources 
• Lantz, 
B., 
(2013). 
Machine 
Learning 
with 
R. 
Packt 
Publishing. 
(highly 
recommended) 
• Miller, 
T., 
(2014). 
Modeling 
Techniques 
in 
Predic;ve 
Analy;cs: 
Business 
Problems 
and 
Solu;ons. 
Pearson 
EducaDon. 
• R 
Reference 
Card 
2.0 
– 
h=p://cran.r-­‐project.org/doc/contrib/Baggo=-­‐refcard-­‐v2.pdf 
• R 
Reference 
Card 
for 
Data 
Mining 
– h=p://cran.r-­‐project.org/doc/contrib/YanchangZhao-­‐refcard-­‐data-­‐mining.pdf 
• 
R-­‐bloggers 
for 
news 
and 
tutorials 
– h=p://www.r-­‐bloggers.com
Appendices
Appendix 
A 
The 
score.senDment() 
funcDon 
#' 
#' score.sentiment() implements a very simple algorithm 
to estimate 
#' sentiment, assigning a integer score by subtracting 
the number 
#' of occurrences of negative words from that of 
positive words. 
#' 
#' @param sentences vector of text to score 
#' @param pos.words vector of words of postive sentiment 
#' @param neg.words vector of words of negative 
sentiment 
#' @param .progress passed to <code>laply()</code> to 
control of progress bar. 
#' @returnType data.frame 
#' @return data.frame of text and corresponding 
sentiment scores 
#' @author Jefrey Breen jbreen@cambridge.aero 
h=ps://github.com/jeffreybreen/twi=er-­‐senDment-­‐analysis-­‐tutorial-­‐201107
score.sentiment = function(sentences, pos.words, 
neg.words, .progress='none') 
{ 
require(plyr) 
require(stringr) 
# we got a vector of sentences. plyr will handle a list or a 
vector as an "l" for us 
# we want a simple array of scores back, so we use "l" + "a" 
+ "ply" = laply: 
scores = laply(sentences, function(sentence, pos.words, 
neg.words) { 
# remove the non-text characters 
sentence <- gsub("[^[:alnum:]///' ]", "", sentence) 
# clean up sentences with R's regex-driven global 
substitute, gsub(): 
sentence = gsub('[[:punct:]]', '', sentence) 
sentence = gsub('[[:cntrl:]]', '', sentence) 
sentence = gsub('d+', '', sentence) 
# and convert to lower case: 
sentence = tolower(sentence)
# split into words. str_split is in the stringr package 
word.list = str_split(sentence, 's+') 
# sometimes a list() is one level of hierarchy too much 
words = unlist(word.list) 
# compare our words to the dictionaries of positive & 
negative terms 
pos.matches = match(words, pos.words) 
neg.matches = match(words, neg.words) 
# match() returns the position of the matched term or NA 
# we just want a TRUE/FALSE: 
pos.matches = !is.na(pos.matches) 
neg.matches = !is.na(neg.matches) 
# and conveniently enough, TRUE/FALSE will be treated as 
1/0 by sum(): 
score = sum(pos.matches) - sum(neg.matches) 
return(score) 
}, pos.words, neg.words, .progress=.progress ) 
scores.df = data.frame(score=scores, text=sentences) 
return(scores.df) 
}

More Related Content

What's hot (19)

PPTX
R programming Language , Rahul Singh
Ravi Basil
 
PDF
Introduction to R
Samuel Bosch
 
PDF
R programming groundup-basic-section-i
Dr. Awase Khirni Syed
 
PDF
Introduction to Rstudio
Olga Scrivner
 
PDF
Class ppt intro to r
JigsawAcademy2014
 
PPTX
R programming language
Keerti Verma
 
PPTX
R Programming
Abhishek Pratap Singh
 
PPTX
Workshop presentation hands on r programming
Nimrita Koul
 
PPTX
R programming Fundamentals
Ragia Ibrahim
 
PPT
R programming
TIB Academy
 
PPTX
R programming
Nandhini G
 
PPTX
R programming presentation
Akshat Sharma
 
PPTX
Introduction to R
Ajay Ohri
 
PPTX
R programming
Shantanu Patil
 
PDF
Introduction to data analysis using R
Victoria López
 
PDF
Weka tutorial
GRajendra
 
PPT
Rtutorial
Dheeraj Dwivedi
 
PDF
Intro to R statistic programming
Bryan Downing
 
PPTX
Introduction to statistical software R
Paola Pozzolo - La tua statistica
 
R programming Language , Rahul Singh
Ravi Basil
 
Introduction to R
Samuel Bosch
 
R programming groundup-basic-section-i
Dr. Awase Khirni Syed
 
Introduction to Rstudio
Olga Scrivner
 
Class ppt intro to r
JigsawAcademy2014
 
R programming language
Keerti Verma
 
R Programming
Abhishek Pratap Singh
 
Workshop presentation hands on r programming
Nimrita Koul
 
R programming Fundamentals
Ragia Ibrahim
 
R programming
TIB Academy
 
R programming
Nandhini G
 
R programming presentation
Akshat Sharma
 
Introduction to R
Ajay Ohri
 
R programming
Shantanu Patil
 
Introduction to data analysis using R
Victoria López
 
Weka tutorial
GRajendra
 
Rtutorial
Dheeraj Dwivedi
 
Intro to R statistic programming
Bryan Downing
 
Introduction to statistical software R
Paola Pozzolo - La tua statistica
 

Viewers also liked (20)

PPT
YHORG Presentation 23 February 2016
Richard Vidgen
 
PDF
Big data and value creation
Richard Vidgen
 
PDF
ProSIS - pro social information systems - Vidgen March 2013
Richard Vidgen
 
PDF
Introduction to big data
Richard Vidgen
 
PDF
Seefeld stats r_bio
haramaya university
 
PDF
Using r
Mario José
 
PDF
R presentation
Christophe Marchal
 
PDF
2 R Tutorial Programming
Sakthi Dasans
 
PPTX
R Introduction
schamber
 
PDF
Bayesian models in r
Vivian S. Zhang
 
KEY
Presentation R basic teaching module
Sander Timmer
 
PPTX
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
PDF
Introduction to the R Statistical Computing Environment
izahn
 
PDF
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Goran S. Milovanovic
 
PPTX
R- Introduction
Venkata Reddy Konasani
 
PPTX
Minitab 17
Eko Ilman Gunawan
 
PDF
R programming Basic & Advanced
Sohom Ghosh
 
PPT
Non Parametric Tests
Neeraj Kaushik
 
PDF
Introduction to R Programming
izahn
 
YHORG Presentation 23 February 2016
Richard Vidgen
 
Big data and value creation
Richard Vidgen
 
ProSIS - pro social information systems - Vidgen March 2013
Richard Vidgen
 
Introduction to big data
Richard Vidgen
 
Seefeld stats r_bio
haramaya university
 
Using r
Mario José
 
R presentation
Christophe Marchal
 
2 R Tutorial Programming
Sakthi Dasans
 
R Introduction
schamber
 
Bayesian models in r
Vivian S. Zhang
 
Presentation R basic teaching module
Sander Timmer
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
Introduction to the R Statistical Computing Environment
izahn
 
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Goran S. Milovanovic
 
R- Introduction
Venkata Reddy Konasani
 
Minitab 17
Eko Ilman Gunawan
 
R programming Basic & Advanced
Sohom Ghosh
 
Non Parametric Tests
Neeraj Kaushik
 
Introduction to R Programming
izahn
 
Ad

Similar to R tutorial (20)

PPTX
DATA MINING USING R (1).pptx
myworld93
 
PPT
Basics of R
Sachita Yadav
 
PDF
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
PDF
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
PDF
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
PDF
R basics
Sagun Baijal
 
PDF
R - the language
Mike Martinez
 
PDF
Introduction to r
Alberto Labarga
 
PPTX
Getting Started with R
Sankhya_Analytics
 
PPT
How to obtain and install R.ppt
rajalakshmi5921
 
PPTX
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
PPT
Basics of R-Programming with example.ppt
geethar79
 
PPT
Basocs of statistics with R-Programming.ppt
geethar79
 
PPT
R-Programming.ppt it is based on R programming language
Zoha681526
 
PPT
Introduction to R for Data Science Technology
gufranqureshi506
 
PPTX
Intro to data science module 1 r
amuletc
 
PPT
Basics of R-Progranmming with instata.ppt
geethar79
 
PPT
17641.ppt
AhmedAbdalla903058
 
PPT
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
DATA MINING USING R (1).pptx
myworld93
 
Basics of R
Sachita Yadav
 
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
R basics
Sagun Baijal
 
R - the language
Mike Martinez
 
Introduction to r
Alberto Labarga
 
Getting Started with R
Sankhya_Analytics
 
How to obtain and install R.ppt
rajalakshmi5921
 
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
Basics of R-Programming with example.ppt
geethar79
 
Basocs of statistics with R-Programming.ppt
geethar79
 
R-Programming.ppt it is based on R programming language
Zoha681526
 
Introduction to R for Data Science Technology
gufranqureshi506
 
Intro to data science module 1 r
amuletc
 
Basics of R-Progranmming with instata.ppt
geethar79
 
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
Ad

Recently uploaded (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 

R tutorial

  • 1. An introduction to R Version 1.4 November 2014 h=p://datasciencebusiness.wordpress.com Prof. Richard Vidgen Management Systems Hull University Business School E: [email protected]
  • 2. Aims and use of this presentaDon This presentaDon provides an introducDon to R for anyone who wants to get an idea of what R is and what it can do. Although no prior experience of R is needed, a basic understanding of data analysis (e.g., mulDple regression) is assumed, as is a basic technical competence (e.g., installing soKware and managing directory structures). If you are already using SPSS you will get a feel for how it compares with R. It’s a work in progress and will be updated based on experience and feedback. Docendo discimus, (LaDn "by teaching, we learn").
  • 3. Contents • PredicDve analyDcs and analyDcs tools • An overview of R • Installing R and an R IDE (integrated development environment) • R syntax and data types • MulDple regression in R and SPSS • The Twi=eR package • The Rfacebook package
  • 4. Resources • The R source and data files for this tutorial can be accessed at: – h=p://datasciencebusiness.wordpress.com • R_intro.R • R_regression.R • R_twi=er.R • R_twi=er_senDment.R • senDment.R • R_facebook.R • insurance.csv • twi=erHull.csv • posiDve-­‐words.txt • negaDve-­‐words.txt WARNING: when R packages are updated by developers things can break and your R programs may stop working. The code in the files above is tested regularly with the latest packages and updated as necessary
  • 6. Be=er decisions -­‐ predicDve analyDcs • A predicDve model that calculates strawberry purchases based on: – Weather forecast – Store temperature – Freezer sensor data – Remaining stock per shelf life – Sales transacDon point of sale feeds – Web searches, social menDons h=p://www.slideshare.net/datasciencelondon/big-­‐data-­‐sorry-­‐data-­‐science-­‐what-­‐does-­‐a-­‐data-­‐scienDst-­‐do
  • 7. PredicDve analyDcs • For example, what data might help us predict which students will drop out? – Assessment grades at University – Prior educaDon a=ainment – Social background – Distance of home from University – Friendship circles and networks (e.g., sports club memberships) – A=endance at lectures and tutorials – InteracDon in lectures and tutorials – Time spent on campus – Time spent in library – Number of accesses to electronic learning resources – Text books purchased – Engagement in subject-­‐related forums – SenDment of social media posts – Etc.
  • 9. Some of the techniques data scienDsts use • ClassificaDon • Clustering • AssociaDon rules • Decision trees • Regression • GeneDc algorithms • Neural networks and support vector machines • Machine learning • Natural language processing • SenDment analysis • ArDficial intelligence • Time series analysis • SimulaDons • Social network analysis
  • 10. Technologies for data analysis: usage rates King, J., & R. Magoulas (2013). Data Science Salary Survey. O’Reilly Media. R and Python programming languages come above Excel Enterprise products bo=om of the heap
  • 11. Data scienDst as “bricoleur” “In the pracDcal arts and the fine arts, bricolage (French for "Dnkering") is the construcDon or creaDon of a work from a diverse range of things that happen to be available, or a work created by such a process.” Wikipedia
  • 13. What is R? R is an open source computer language used for data manipulaDon, staDsDcs, and graphics.
  • 14. History of R • 1976 – Bell Labs develops S, a language for data analysis; released commercially as S-­‐plus • 1990s – R wri=en and released as open source by (R)oss Ihaka and (R)obert Gentleman • 1997 – The Comprehensive R Archive Network (CRAN) launched • August 2014 – CRAN repository contains 5789 user-­‐contributed packages
  • 15. Benefits of R • It’s free! • Runs on mulDple plaqorms (Windows, Unix, MacOS) • ValidaDon/replicaDon of analyses (assumes commented code and documentaDon) • Long term efficiency (using the same code for mulDple projects)
  • 16. SPSS* vs R SPSS • Limited ability for data scienDst to change the environment • Data scienDst relies on algorithms developed by SPSS • Problem-­‐solving constrained by SPSS developers • Must pay for using the constrained algorithms R • Can use funcDons made by a global community of staDsDcs researchers or create their own • Almost unlimited in their ability to change their environment • Can do things SPSS users cannot even dream of • Get all this for free *or any other proprietary closed soKware system
  • 19. 2. Output appears here 1. Type in commands, select the text and run with (cmd + return) or menu opDon: edit | execute
  • 20. R integrated development environments (IDEs) • Some free IDEs – RevoluDon R Enterprise – Architect – R Studio • Most widely used R IDE • It’s simple and intuiDve • Used to build this tutorial
  • 23. R Studio h=p://www.rstudio.com Install R Studio from here
  • 24. R Studio Type code here Results appear here Environment and history Packages, plots, files
  • 26. Basic grammar of R object = funcDon(arguments)
  • 27. Guess what this does Z <-­‐ read.table(“MyFile.txt”)
  • 28. Two ways of doing it = is the same as <-­‐
  • 30. Reading the slides # Comments are in blue <-­‐ Code is in green Output is in black
  • 31. Set the working directory # For the tutorial, load all the R program and data files provided (see Resources slide) # into a directory of your choice # Set your working directory to this directory, e.g., for a Mac setwd("/Users/ … somewhere on your computer … /R_tutorial") # and for Windows Setwd("C:/ … somewhere on your computer … /R_tutorial") # List the files in the directory list.files() > list.files() [1] "insurance.csv" "negaDve-­‐words.txt" "posiDve-­‐words.txt" [4] "R_facebook.R" "R_intro.R" "R_regression.R" [7] "R_twi=er_senDment.R" "R_twi=er.R" "senDment.R" [10] "twi=erHull.csv"
  • 32. Data types and data structures Data types Numeric Character Logical Data structures Vectors Lists MulD-­‐dimensional Matrices Dataframes
  • 33. Vectors and data classes Values can be combined into vectors using the c() funcDon # this is a comment! num.var <- c(1, 2, 3, 4) # numeric vector! char.var <- c("1", "2", "3", "4") # character vector! log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector! Vectors have a class which determines how funcDons treat them > class(num.var)! [1] "numeric"! > class(char.var)! [1] "character"! > class(log.var)! [1] "logical"!
  • 34. Vectors and data classes Can calculate mean of a numeric vector, but not of a character vector > mean(num.var)! [1] 2.5! > mean(char.var)! [1] NA! Warning message:! In mean.default(char.var) :! argument is not numeric or logical: returning NA!
  • 35. Lists # create a list -­‐ a collecDon of vectors employees <-­‐ c("John", "Sunil", "Anna") yearsService <-­‐ c(3, 2, 6) empDetails <-­‐ list(employees, yearsService) class(empDetails) empDetails > class(empDetails) [1] "list" > empDetails [[1]] [1] "John" "Sunil" "Anna" [[2]] [1] 3 2 6
  • 36. Dataframes A data.frame is a list of vectors, each of the same length DF <- data.frame(x=1:5, y=letters[1:5], z=letters[6:10])! > DF # data.frame with 3 columns and 5 rows! x y Z! 1 1 a f! 2 2 b g! 3 3 c h! 4 4 d i! 5 5 e j!
  • 38. insurance.csv • insurance.csv contains medical expenses for paDents enrolled in a healthcare plan • The data file contains 1,338 cases with features of the paDent as well as the total medical expenses charged to the paDent’s healthcare plan for the calendar year • There are no missing values (these would be shown as NA in R indicaDng empty or null)
  • 39. insurance.csv Variable Descrip=on age an integer indicaDng the age of the beneficiary sex Either “male” or “female” bmi body mass index (BMI), which gives an indicaDon of how over or under-­‐ weight a person is. BMI is calculated as weight in kilograms divideD by height in metres squared. An ideal BMI is in the range 18.5 to 24.9 children an integer showing the number of children/dependents covered by the plan smoker “yes” or “no” region the beneficiary’s place of residence, divided into four regions: “northeast”, “southeast”, “southwest”, or “northwest”. This example is taken from Lantz (2013)
  • 40. Read the data insurance <-­‐ read.csv("insurance.csv", stringsAsFactors = TRUE) head(insurance) age sex bmi children smoker region charges! 1 19 female 27.900 0 yes southwest 16884.924! 2 18 male 33.770 1 no southeast 1725.552! 3 28 male 33.000 3 no southeast 4449.462! 4 33 male 22.705 0 no northwest 21984.471! 5 32 male 28.880 0 no northwest 3866.855! 6 31 female 25.740 0 no southeast 3756.622!
  • 41. Working with data • There are specific funcDons for reading from (and wriDng to) excel, SPSS, SAS, etc. • However, the simplest way is to export and import files in csv (comma separated values) format – the lingua franca of data
  • 42. Explore the data summary(insurance$charges)! Min. 1st Qu. Median Mean 3rd Qu. Max. ! 1122 4740 9382 13270 16640 63770 ! table(insurance$region)! northeast northwest southeast southwest ! 324 325 364 325 ! cor(insurance[c("age", "bmi", "children", "charges")])! age bmi children charges! age 1.0000000 0.1092719 0.04246900 0.29900819! bmi 0.1092719 1.0000000 0.01275890 0.19834097! children 0.0424690 0.0127589 1.00000000 0.06799823! charges 0.2990082 0.1983410 0.06799823 1.00000000!
  • 43. Visualise the data hist(insurance$charges)
  • 44. Visualise the data pairs(insurance[c("age", "bmi", "children", "charges")])
  • 45. Visualise the data -­‐ be=er library(psych) pairs.panels(insurance[c("age", "bmi", "children", "charges")], hist.col="yellow”)
  • 46. Installing packages • pairs.panels is a funcDon in the psych package, which needs to be installed:
  • 47. MulDple regression 1 ins_model1 <- lm(charges ~ age + children + bmi, data = insurance)! Residuals:! Min 1Q Median 3Q Max ! -13884 -6994 -5092 7125 48627 ! ! Coefficients:! 11.8% of variaDon in insurance charges is explained by the model Estimate Std. Error t value Pr(>|t|) ! (Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***! age 239.99 22.29 10.767 < 2e-16 ***! children 542.86 258.24 2.102 0.0357 * ! bmi 332.08 51.31 6.472 1.35e-10 ***! ---! Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! ! Residual standard error: 11370 on 1334 degrees of freedom! Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 ! F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!
  • 48. SPSS
  • 50. MulDple regression 2 ins_model2 <- lm(charges ~ age + children + bmi + sex + smoker + region, data = insurance)! Residuals:! Min 1Q Median 3Q Max ! -11304.9 -2848.1 -982.1 1393.9 29992.8 ! ! Coefficients:! 74.9% of variaDon in insurance charges is explained by the model Estimate Std. Error t value Pr(>|t|) ! (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***! age 256.9 11.9 21.587 < 2e-16 ***! children 475.5 137.8 3.451 0.000577 ***! bmi 339.2 28.6 11.860 < 2e-16 ***! sexmale -131.3 332.9 -0.394 0.693348 ! smokeryes 23848.5 413.1 57.723 < 2e-16 ***! regionnorthwest -353.0 476.3 -0.741 0.458769 ! regionsoutheast -1035.0 478.7 -2.162 0.030782 * ! regionsouthwest -960.0 477.9 -2.009 0.044765 * ! ---! Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1! ! Residual standard error: 6062 on 1329 degrees of freedom! Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 ! F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!
  • 51. Dummy variables • R coded character vectors as factors and automaDcally analysed them as dummy variables – stringsAsFactors = TRUE • In SPSS these need to be coded by hand
  • 52. SPSS -­‐ create dummy variables • Recode sex • 0 = female • 1 = male • Recode smoker – 1 = yes – 0 = no • Recode region – Number of dummies = number of groups – 1 – = 4 – 1 = 3 • northeast = 0, 0, 0 • northwest = 1, 0, 0 • southeast = 0, 1, 0 • southwest = 0, 0, 1
  • 56. Data set with all dummy coded variables created* *aKer quite a bit of work!
  • 59. Twi=eR • Install the Twi=eR package to access the Twi=er API • Before you can access Twi=er from R you have to: – Sign up for a Twi=er developer account – Create a Twi=er app and copy the authenDcaDon details
  • 60. Twi=er authenDcaDon xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx For details of how to authenDcate and set up Twi=eR: h=p://thinktostart.wordpress.com/2013/05/22/twi=er-­‐authenDficaDon-­‐with-­‐r/
  • 61. Twi=er authenDcaDon ############################################# # 1 -­‐ AuthenDcate with twi=er API ############################################# library(twi=eR) library(ROAuth) api_key <-­‐ ”xxxxxxxxxxxxxxxxxxYsnu5NM" api_secret <-­‐ "wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" access_token <-­‐ "581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" access_token_secret <-­‐ "tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" From your Twi=er applicaDon sezngs setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)
  • 62. Retrieve Twi=er account details ############################################################ # 2 -­‐ Get details of Hull Uni Twi=er account ############################################################ twitacc <-­‐ getUser('UniOfHull') twitacc$getDescripDon() twitacc$getFollowersCount() friends = twitacc$getFriends(10) # limit the number of friends returned to 10 friends > twitacc$getDescripDon() [1] "Our official Twi=er feed featuring the latest news and events from the University of Hull" > twitacc$getFollowersCount() [1] 20025 > friends = twitacc$getFriends(10) > friends $`2644622095` [1] "HullNursing" $`1536529304` [1] "HullSimulaDon”
  • 63. Trace the network net <-­‐ getUser(friends[8]) # Sheridansmith1 is friend no. 8 net$getDescripDon() net$getFollowersCount() net$getFriends(n=10) > net$getDescripDon() [1] "Sister of @damiandsmith of that there band @_TheTorn :) x" > net$getFollowersCount() [1] 502167 > net$getFriends(n=10) $`88598283` [1] "overnightstv" $`423373477` [1] "IsleLoseIt" $`19650489` [1] "donneriron"
  • 65. Get tweets and write to file ############################################################ # 3 -­‐ Search Twi=er ############################################################ Twi=er.list <-­‐ searchTwi=er('@UniOfHull', n=500) # limited to 500 tweets for demo Twi=er.df = twListToDF(Twi=er.list) write.csv(Twi=er.df, file='twi=er.csv', row.names=F)
  • 66. Tweet data read into Excel
  • 68. Analyse the tweets: the tm package # read the twi=er data into a data frame (use previously stored .csv to # replicate the twi=er analysis presented here) tweet_raw <-­‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # remove the non-­‐text characters tweet_raw$text <-­‐ gsub("[^[:alnum:]///' ]", "", tweet_raw$text) # build a corpus, which is a collecDon of text documents # VectorSource specifies that the source is a character vector library(tm) myCorpus <-­‐ Corpus(VectorSource(tweet_raw$text))
  • 69. Inspect the corpus # examine the sms corpus inspect(myCorpus[1:3]) [[1]] Have you been to the UniOfHull popup campus in Leeds yet It's at the White Cloth Gallery Aire Street for all your Clearing2014 queries [[2]] RT UniOfHull In Newcastle and looking for Clearing2014 advice Head to the UniOfHull popup campus at Newcastle Arts Centre on Westgate [[3]] RT Bethanyn96 Got into unio‡ull so happy
  • 70. Clean corpus and create a wordcloud # clean up the corpus using tm_map() corpus_clean <-­‐ tm_map(corpus_clean, PlainTextDocument) corpus_clean <-­‐ tm_map(myCorpus, tolower) corpus_clean <-­‐ tm_map(corpus_clean, removeNumbers) corpus_clean <-­‐ tm_map(corpus_clean, removeWords, stopwords()) corpus_clean <-­‐ tm_map(corpus_clean, removePunctuaDon) corpus_clean <-­‐ tm_map(corpus_clean, stripWhitespace) # remove any words not wanted in the wordcloud corpus_clean <-­‐ tm_map(corpus_clean, removeWords, "unio‡ull") # create a wordcloud library(wordcloud) wordcloud(corpus_clean, min.freq = 10, random.order = FALSE, colors=brewer.pal(8, "Dark2"))
  • 72. SenDment analysis • What is the senDment of the tweets? • How does the number of posiDve words compare with the number of negaDve words? • The number of posiDve words minus the number of negaDve words gives a rough indicaDon of the “senDment” of the tweet • The posiDve and negaDve words are taken form a word list developed by Hu and Liu: – h=p://www.cs.uic.edu/~liub/FBS/opinion-­‐lexicon-­‐ English.rar
  • 73. SenDment analysis # load libraries library(plyr) library(ggplot2) # load the score.senDment() funcDon – see appendix A for code source( 'senDment.R' ) # read the tweets saved previously hull.tweets <-­‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # read the lists of pos and neg words from Hu & Liu hu.liu.pos = scan('posiDve-­‐words.txt', what='character', comment.char=';') hu.liu.neg = scan('negaDve-­‐words.txt', what='character', comment.char=';')
  • 74. SenDment analysis # extract the text of the tweets and pass to the senDment funcDon for # scoring (see Appendix A for R code) hull.text = hull.tweets$text hull.scores = score.senDment(hull.text, pos.words, neg.words, .progress='text') # make a histogram of the scores ggplot(hull.scores, aes(x=score)) + geom_histogram(binwidth=1, colour="black", fill="lightblue") + xlab("SenDment score") + ylab("Frequency") + ggDtle("TWITTER: Hull University SenDment Analysis")
  • 75. Number of posiDve word matches minus the number of negaDve word matches
  • 76. WriDng the data out for text analysis • Many text analysis packages require each comment to be in a separate file • If the data is in Excel or SPSS it will be cumbersome to generate the files manually • Write R code instead
  • 77. Create an output directory # create an output directory for the txt files if it does not exist mainDir <-­‐ getwd() subDir <-­‐ "outputText" if (file.exists(subDir)){ setwd(file.path(mainDir, subDir)) } else { dir.create(file.path(mainDir, subDir)) setwd(file.path(mainDir, subDir)) }
  • 78. Loop to write the files # find out how many rows in the data.frame tweets = nrow(Twi=er.df) # loop to write the txt files for (tweet in 1:tweets) { tweetText = Twi=er.df[tweet, 1] filename = paste("output", tweet, ".txt", sep = "") writeLines(tweetText, con = filename) } Note that R has iterators that oKen remove the need to write loops, see the “apply” family of funcDons
  • 81. Rfacebook ## loading libraries library(Rfacebook) library(igraph) # get your token from 'h=ps://developers.facebook.com/tools/explorer/' # make sure you give permissions to access your list of friends # set your FB token token <-­‐ “xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx” For details of how to authenDcate and use Rfacebook: h=p://pablobarbera.com/blog/archives/3.html Get the access token here: h=ps://developers.facebook.com/tools/explorer
  • 82. Get the data and graph the network # download adjacency matrix for network of Facebook friends my_network <-­‐ getNetwork(token, format="adj.matrix") # friends who are friends with me alone singletons <-­‐ rowSums(my_network)==0 # graph the network my_graph <-­‐ graph.adjacency(my_network[!singletons,! singletons]) layout <-­‐ layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) plot(my_graph, vertex.size=2, #vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE,layout=layout)
  • 83. Facebook network graph Ok, it’s ugly – there are plenty more social network analysis and graphing packages in R to try, or you can write a bit of code to export the adjacency matrix to another package, e.g., UCINET/Netdraw, Pajek, Gephi
  • 84. write.graph(graph = my_graph, file = '‹.gml', format = 'gml') VisualizaDon in Gephi
  • 85. To find out what funcDons a package supports, what they do and how to call them see the documentaDon
  • 86. Beg, borrow, steal code! • Don’t bother wriDng code from scratch • If you want to know how to do something then Google it • There will likely be a soluDon that you can scrape off the screen and modify for your own purposes • For example – How would you remove rows from a dataframe with missing values (NA)? – Try “r how to remove missing values” in Google
  • 87. What’s next? • Access data held in SQL databases and store your own data, e.g., MySQL, and the packages • Read, write, manipulate Excel spreadsheets using xls and XLConnect • Access maps, e.g., Google Maps, and overlay locaDon data (e.g., Tweets) on a map • Screen scrape Web sites that don’t have APIs (e.g., Google Scholar)
  • 88. Suggested further reading and resources • Lantz, B., (2013). Machine Learning with R. Packt Publishing. (highly recommended) • Miller, T., (2014). Modeling Techniques in Predic;ve Analy;cs: Business Problems and Solu;ons. Pearson EducaDon. • R Reference Card 2.0 – h=p://cran.r-­‐project.org/doc/contrib/Baggo=-­‐refcard-­‐v2.pdf • R Reference Card for Data Mining – h=p://cran.r-­‐project.org/doc/contrib/YanchangZhao-­‐refcard-­‐data-­‐mining.pdf • R-­‐bloggers for news and tutorials – h=p://www.r-­‐bloggers.com
  • 90. Appendix A The score.senDment() funcDon #' #' score.sentiment() implements a very simple algorithm to estimate #' sentiment, assigning a integer score by subtracting the number #' of occurrences of negative words from that of positive words. #' #' @param sentences vector of text to score #' @param pos.words vector of words of postive sentiment #' @param neg.words vector of words of negative sentiment #' @param .progress passed to <code>laply()</code> to control of progress bar. #' @returnType data.frame #' @return data.frame of text and corresponding sentiment scores #' @author Jefrey Breen [email protected] h=ps://github.com/jeffreybreen/twi=er-­‐senDment-­‐analysis-­‐tutorial-­‐201107
  • 91. score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: scores = laply(sentences, function(sentence, pos.words, neg.words) { # remove the non-text characters sentence <- gsub("[^[:alnum:]///' ]", "", sentence) # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('d+', '', sentence) # and convert to lower case: sentence = tolower(sentence)
  • 92. # split into words. str_split is in the stringr package word.list = str_split(sentence, 's+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }