SlideShare a Scribd company logo
SHARETHIS
DATA ANALYSIS with R
Hassan Namarvar
2
WHAT IS R?
• R is a free software programming language and software
development for statistical computing and graphics.
• It is similar to S language developed at AT&T Bell Labs by Rick
Becker, John Chambers and Allan Wilks.
• R was initially developed by Ross Ihaka and Robert Gentleman
(1996), from the University of Auckland, New Zealand.
• R source code is written in C, Fortran, and R.
3
R PARADIGMS
Multi paradigms:
– Array
– Object-oriented
– Imperative
– Functional
– Procedural
– Reflective
4
STATISTICAL FEATURES
• Graphical Techniques
• Linear and nonlinear modeling
• Classical statistical tests
• Time-series analysis
• Classification
• Clustering
• Machine learning
5
PROGRAMMING FEATURES
• R is an interpreted language
• Access R through a command-line interpreter
• Like MATLAB, R supports matrix arithmetic
• Data structures:
– Vectors
– Metrics
– Array
– Data Frames
– Lists
6
ADVANTAGES OF R
• The most comprehensive statistical analysis package
available.
• Outstanding graphical capabilities
• Open source software – reviewed by experts
• R is free and licensed under the GNU.
• R has over 5,578 packages as of May 31, 2014!
• R is cross-platform. GNU/Linux, Mac, Windows.
• R plays well with CSV, SAS, SPSS, Excel, Access, Oracle, MySQL,
and SQLite.
7
HOW TO INSTALL R?
• Download an install the latest version from:
– http://cran.r-project.org
• Install packages from R Console:
– > install.packages(‘package_name’)
• R has its own LaTeX-like documentation:
– > help()
8
STARTING WITH R
• In R console:
– > x <- 2
– > x
– > y <- x^2
– > y
– > ls()
– > rm(y)
• Vectors:
– > v <- c(4, 7, 23.5, 76.2, 80)
– > Summary(v)
9
STARTING WITH R
• Histogram:
– > r <- rnorm(100)
– > summary(r)
– > plot(r)
– > hist(r)
• QQ-Plot (Quantile):
– > qqplot(r, rnorm(1000))
10
STARTING WITH R
• Factors:
– > g <- c(‘f’, ‘m’, ‘m’, ‘m’, ‘f’, ‘m’, ‘f’, ‘m’)
– > h <- factor(g)
– > table(g)
• Matrices:
– > r <- rnorm(100)
– > dim(r) <- c(50,2)
– > r
– > Summary(r)
– > M <- matrix(c(45, 23, 66, 77, 33, 44), 2, 3,
byrow=T)
11
STARTING WITH R
• Data Frames:
– > n = c(2, 3, 5)
– > s = c("aa", "bb", "cc")
– > b = c(TRUE, FALSE, TRUE)
– > df = data.frame(n, s, b)
• Built-in Data Set:
– > state.x77
– > st = as.data.frame(state.x77)
– > st$Density = st$Population * 1000 / st$Area
– > summary(st)
– > cor(st)
– > pairs(st)
12
STARTING WITH R
Population
3000 5500 68 71 40 55 0e+00 5e+05
015000
30005500
Income
Illiteracy
0.52.0
6871
Life Exp
Murder
2814
4055
HS Grad
Frost
0100
0e+005e+05
Area
0 15000 0.5 2.0 2 8 14 0 100 0 600
0600
Density
13
LINEAR REGRESSION MODEL IN R
• Linear Regression Model:
– > x <- 1:100
– > y <- x^3
– Model y = a + b . x
– > lm(y ~ x)
– > model <- lm(y ~ x)
– > summary(model)
– > par(mfrow=c(2,2))
– > plot(model)
14
LM MODEL
– Call:
– lm(formula = y ~ x)
– Residuals:
– Min 1Q Median 3Q Max
– -129827 -103680 -29649 85058 292030
– Coefficients:
– Estimate Std. Error t value Pr(>|t|)
– (Intercept) -207070.2 23299.3 -8.887 3.14e-14 ***
– x 9150.4 400.6 22.844 < 2e-16 ***
– ---
– Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
– Residual standard error: 115600 on 98 degrees of freedom
– Multiple R-squared: 0.8419, Adjusted R-squared: 0.8403
– F-statistic: 521.9 on 1 and 98 DF, p-value: < 2.2e-16
15
LM MODEL
0 20 40 60 80 100
0e+002e+054e+056e+058e+051e+06
y=x^3
x
y
16
DIAGNOSIS PLOT
-2e+05 2e+05 4e+05 6e+05
-1e+051e+053e+05
Fitted values
Residuals
Residuals vs Fitted
100
99
98
-2 -1 0 1 2
-10123
Theoretical Quantiles
Standardizedresiduals
Normal Q-Q
100
99
98
-2e+05 2e+05 4e+05 6e+05
0.00.51.01.5
Fitted values
Standardizedresiduals
Scale-Location
100
99
98
0.00 0.01 0.02 0.03 0.04
-10123
Leverage
Standardizedresiduals
Cook's distance
Residuals vs Leverage
100
99
98
17
LINEAR REGRESSION MODEL IN R
• Model Built-in Data:
– > colnames(st)[4] = "Life.Exp"
– > colnames(st)[6] = "HS.Grad"
– model1 = lm(Life.Exp ~ Population + Income
+ Illiteracy + Murder + HS.Grad + Frost +
Area + Density, data=st)
– > summary(model1)
– > model2 <- step(model1)
– > model3 = update(model2, .~.-Population)
– > Summary(model3)
18
LINEAR REGRESSION MODEL IN R
• Confidence limits on Estimated Coefficients:
– > confint(model3)
– > predict(model3, list(Murder=10.5,
HS.Grad=48, Frost=100))
19
OUTLIERS
• Boxplot:
– > v <- rnorm(100)
– > v = c(v,10)
– > boxplot(v)
– > rug(jitter(v), side=2)
-20246810
20
PROBABILITY DENSITY FUNCTION
• PDF:
– > r <- rnorm(1000)
– > hist(r, prob=T)
– > lines(density(r), col="red") Histogram of r
r
Density
-3 -2 -1 0 1 2 3
0.00.10.20.30.4
21
CASE STUDY: SHARETHIS EXAMPLE
• Relationship of clicks with winning price and Impression on
ADX:
• Data
– Analyzed ADX Hourly Impression Logs
• Method
– Detected outliers
– Predicted clicks using a regression tree model
22
CASE STUDY: SHARETHIS EXAMPLE
• Outlier Detection:
Clicks Impressions
23
CASE STUDY: SHARETHIS EXAMPLE
• Regression Tree
– One of the most powerful classification/regression
– > library(rpart)
– > fit <- rpart(log(CLK) ~ log(IMP) + AVG_PRICE +
SD_PRICE, data=x)
– > plot(fit)
– > text(fit)
– > plot(predict(fit), log(x$CLK))
24
CASE STUDY: SHARETHIS EXAMPLE
• Regression Tree
|
log(IMP)< 9.33
log(IMP)< 8.349 log(IMP)< 11.28
SD_PRICE< 0.2604
log(IMP)>=10.04 log(IMP)< 10.39
AVG_PRICE>=1.713 AVG_PRICE>=1.247
AVG_PRICE< 0.8555
log(IMP)< 12.49
0.751 1.387
1.541 2.869
1.959 2.729
3.003
3.104 4.331
3.577 4.753
25
CASE STUDY: SHARETHIS EXAMPLE
• Predict Log of Clicks
0 1 2 3 4 5 6 7
1234
log(x$CLK)
predict(fit)
26
CASE STUDY: COLOR DETECTION
• Detect color from product image:
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
27
RESOURCES
• Books:
– An Introduction to Statistical Learning: with
Applications in R by G. James, D. Witten, T. Hatie,
R. Tibshirani, 2013
– The Art of R Programming: A Tour of Statistical
Software Design, N. Matloff, 2011
– R Cookbook (O'Reilly Cookbooks), P. Teetor, 2011
• R Blog:
– http://www.r-bloggers.com

More Related Content

What's hot (20)

PPTX
Data visualization using R
Ummiya Mohammedi
 
PDF
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
PPTX
Data Management in R
Sankhya_Analytics
 
PPTX
Data analytics with R
Dr. C.V. Suresh Babu
 
PDF
Class ppt intro to r
JigsawAcademy2014
 
PPTX
Statistics for data science
zekeLabs Technologies
 
PPTX
Python Seaborn Data Visualization
Sourabh Sahu
 
PPTX
Unit 1 - R Programming (Part 2).pptx
Malla Reddy University
 
PDF
Linear Regression With R
Edureka!
 
PDF
Introduction to R Graphics with ggplot2
izahn
 
PPT
R studio
Kinza Irshad
 
PPTX
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
PPTX
Exploratory Data Analysis
Umair Shafique
 
PPTX
Getting Started with R
Sankhya_Analytics
 
PDF
Introduction to R
Kazuki Yoshida
 
PPT
R programming slides
Pankaj Saini
 
PPTX
Big Data Analytics
Ghulam Imaduddin
 
PPTX
Introduction to Data Analytics
Utkarsh Sharma
 
PPTX
Logistic regression
saba khan
 
PDF
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Edureka!
 
Data visualization using R
Ummiya Mohammedi
 
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Data Management in R
Sankhya_Analytics
 
Data analytics with R
Dr. C.V. Suresh Babu
 
Class ppt intro to r
JigsawAcademy2014
 
Statistics for data science
zekeLabs Technologies
 
Python Seaborn Data Visualization
Sourabh Sahu
 
Unit 1 - R Programming (Part 2).pptx
Malla Reddy University
 
Linear Regression With R
Edureka!
 
Introduction to R Graphics with ggplot2
izahn
 
R studio
Kinza Irshad
 
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
Exploratory Data Analysis
Umair Shafique
 
Getting Started with R
Sankhya_Analytics
 
Introduction to R
Kazuki Yoshida
 
R programming slides
Pankaj Saini
 
Big Data Analytics
Ghulam Imaduddin
 
Introduction to Data Analytics
Utkarsh Sharma
 
Logistic regression
saba khan
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Edureka!
 

Viewers also liked (20)

PDF
Iris data analysis example in R
Duyen Do
 
PPT
Discriminant analysis basicrelationships
divyakalsi89
 
PPTX
An Interactive Introduction To R (Programming Language For Statistics)
Dataspora
 
PDF
Big Data Analytics with R
Great Wide Open
 
PPTX
R for data analytics
VijayMohan Vasu
 
PDF
R programming Basic & Advanced
Sohom Ghosh
 
PPTX
R language tutorial
David Chiu
 
PDF
R learning by examples
Michelle Darling
 
PDF
Data Clustering with R
Yanchang Zhao
 
PPTX
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
PDF
Biopilot training centre @ vadodara
Dr.Sumant Chaubey,Biologics Biosimilar
 
DOCX
Logistic Regression in R-An Exmple.
Dr. Volkan OBAN
 
PPT
Applied spatial data introducing
Ha Hoang
 
PPTX
Probability based learning (in book: Machine learning for predictve data anal...
Duyen Do
 
PDF
Introtor
Kamakshaiah M
 
PDF
Building powerful dashboards with r shiny
Victoria Blechman-Pomogajko
 
PPTX
R programming language in spatial analysis
Abhiram Kanigolla
 
PDF
Data clustering
GARIMA SHAKYA
 
PPTX
Example R usage for oracle DBA UKOUG 2013
BertrandDrouvot
 
PPTX
Introduction To R
Michael Driscoll
 
Iris data analysis example in R
Duyen Do
 
Discriminant analysis basicrelationships
divyakalsi89
 
An Interactive Introduction To R (Programming Language For Statistics)
Dataspora
 
Big Data Analytics with R
Great Wide Open
 
R for data analytics
VijayMohan Vasu
 
R programming Basic & Advanced
Sohom Ghosh
 
R language tutorial
David Chiu
 
R learning by examples
Michelle Darling
 
Data Clustering with R
Yanchang Zhao
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
Biopilot training centre @ vadodara
Dr.Sumant Chaubey,Biologics Biosimilar
 
Logistic Regression in R-An Exmple.
Dr. Volkan OBAN
 
Applied spatial data introducing
Ha Hoang
 
Probability based learning (in book: Machine learning for predictve data anal...
Duyen Do
 
Introtor
Kamakshaiah M
 
Building powerful dashboards with r shiny
Victoria Blechman-Pomogajko
 
R programming language in spatial analysis
Abhiram Kanigolla
 
Data clustering
GARIMA SHAKYA
 
Example R usage for oracle DBA UKOUG 2013
BertrandDrouvot
 
Introduction To R
Michael Driscoll
 
Ad

Similar to Data analysis with R (20)

PPT
Introduction to R for Data Science Technology
gufranqureshi506
 
PPT
Basics of R-Progranmming with instata.ppt
geethar79
 
PPT
17641.ppt
AhmedAbdalla903058
 
PPT
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
PPT
17641.ppt
vikassingh569137
 
PPT
How to obtain and install R.ppt
rajalakshmi5921
 
PPT
Advanced Data Analytics with R Programming.ppt
Anshika865276
 
PPTX
R programming
Dr. Vaibhav Kumar
 
PDF
Itroroduction to R language
chhabria-nitesh
 
PPTX
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
PDF
Basics of R programming for analytics [Autosaved] (1).pdf
suanshu15
 
PPTX
DATA MINING USING R (1).pptx
myworld93
 
PDF
Practical data science_public
Long Nguyen
 
PDF
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
PDF
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
PDF
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
PDF
R - the language
Mike Martinez
 
PPTX
Unit 3
Piyush Rochwani
 
PPT
Inroduction to r
manikanta361
 
PPTX
Intro to data science module 1 r
amuletc
 
Introduction to R for Data Science Technology
gufranqureshi506
 
Basics of R-Progranmming with instata.ppt
geethar79
 
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
17641.ppt
vikassingh569137
 
How to obtain and install R.ppt
rajalakshmi5921
 
Advanced Data Analytics with R Programming.ppt
Anshika865276
 
R programming
Dr. Vaibhav Kumar
 
Itroroduction to R language
chhabria-nitesh
 
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
Basics of R programming for analytics [Autosaved] (1).pdf
suanshu15
 
DATA MINING USING R (1).pptx
myworld93
 
Practical data science_public
Long Nguyen
 
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
R - the language
Mike Martinez
 
Inroduction to r
manikanta361
 
Intro to data science module 1 r
amuletc
 
Ad

More from ShareThis (20)

PDF
ShareThis Canadian Millennials Study_2015
ShareThis
 
PPTX
Real time pipeline at terabyte sacle
ShareThis
 
PDF
ShareThis TV Study
ShareThis
 
PPTX
Q1/2015 ShareThis Consumer Sharing Trends Report
ShareThis
 
PDF
ShareThis Finance Study
ShareThis
 
PPTX
DataScienceInnovation_ShareThis
ShareThis
 
PPTX
Share this influentialdemocrats_jan2015
ShareThis
 
PDF
ShareThis TravelStudy-2014
ShareThis
 
PPTX
ShareThis Midterm Elections_2014
ShareThis
 
PPTX
H2O platform workshop
ShareThis
 
PPTX
Q3 2014 Consumer Sharing Trends Report
ShareThis
 
PDF
ShareThis_Return on a Share Study
ShareThis
 
PPTX
Share this millennial study_2014
ShareThis
 
PPT
Data Pipeline Management Framework on Oozie
ShareThis
 
PDF
ShareThis_CSTR_July2014
ShareThis
 
PDF
Sharing Steals the Cup
ShareThis
 
PPTX
ShareThis Auto Study
ShareThis
 
PDF
ShareThis Return on a Share Study
ShareThis
 
PDF
Social TV
ShareThis
 
PPTX
ShareThis RoS
ShareThis
 
ShareThis Canadian Millennials Study_2015
ShareThis
 
Real time pipeline at terabyte sacle
ShareThis
 
ShareThis TV Study
ShareThis
 
Q1/2015 ShareThis Consumer Sharing Trends Report
ShareThis
 
ShareThis Finance Study
ShareThis
 
DataScienceInnovation_ShareThis
ShareThis
 
Share this influentialdemocrats_jan2015
ShareThis
 
ShareThis TravelStudy-2014
ShareThis
 
ShareThis Midterm Elections_2014
ShareThis
 
H2O platform workshop
ShareThis
 
Q3 2014 Consumer Sharing Trends Report
ShareThis
 
ShareThis_Return on a Share Study
ShareThis
 
Share this millennial study_2014
ShareThis
 
Data Pipeline Management Framework on Oozie
ShareThis
 
ShareThis_CSTR_July2014
ShareThis
 
Sharing Steals the Cup
ShareThis
 
ShareThis Auto Study
ShareThis
 
ShareThis Return on a Share Study
ShareThis
 
Social TV
ShareThis
 
ShareThis RoS
ShareThis
 

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 

Data analysis with R

  • 1. SHARETHIS DATA ANALYSIS with R Hassan Namarvar
  • 2. 2 WHAT IS R? • R is a free software programming language and software development for statistical computing and graphics. • It is similar to S language developed at AT&T Bell Labs by Rick Becker, John Chambers and Allan Wilks. • R was initially developed by Ross Ihaka and Robert Gentleman (1996), from the University of Auckland, New Zealand. • R source code is written in C, Fortran, and R.
  • 3. 3 R PARADIGMS Multi paradigms: – Array – Object-oriented – Imperative – Functional – Procedural – Reflective
  • 4. 4 STATISTICAL FEATURES • Graphical Techniques • Linear and nonlinear modeling • Classical statistical tests • Time-series analysis • Classification • Clustering • Machine learning
  • 5. 5 PROGRAMMING FEATURES • R is an interpreted language • Access R through a command-line interpreter • Like MATLAB, R supports matrix arithmetic • Data structures: – Vectors – Metrics – Array – Data Frames – Lists
  • 6. 6 ADVANTAGES OF R • The most comprehensive statistical analysis package available. • Outstanding graphical capabilities • Open source software – reviewed by experts • R is free and licensed under the GNU. • R has over 5,578 packages as of May 31, 2014! • R is cross-platform. GNU/Linux, Mac, Windows. • R plays well with CSV, SAS, SPSS, Excel, Access, Oracle, MySQL, and SQLite.
  • 7. 7 HOW TO INSTALL R? • Download an install the latest version from: – http://cran.r-project.org • Install packages from R Console: – > install.packages(‘package_name’) • R has its own LaTeX-like documentation: – > help()
  • 8. 8 STARTING WITH R • In R console: – > x <- 2 – > x – > y <- x^2 – > y – > ls() – > rm(y) • Vectors: – > v <- c(4, 7, 23.5, 76.2, 80) – > Summary(v)
  • 9. 9 STARTING WITH R • Histogram: – > r <- rnorm(100) – > summary(r) – > plot(r) – > hist(r) • QQ-Plot (Quantile): – > qqplot(r, rnorm(1000))
  • 10. 10 STARTING WITH R • Factors: – > g <- c(‘f’, ‘m’, ‘m’, ‘m’, ‘f’, ‘m’, ‘f’, ‘m’) – > h <- factor(g) – > table(g) • Matrices: – > r <- rnorm(100) – > dim(r) <- c(50,2) – > r – > Summary(r) – > M <- matrix(c(45, 23, 66, 77, 33, 44), 2, 3, byrow=T)
  • 11. 11 STARTING WITH R • Data Frames: – > n = c(2, 3, 5) – > s = c("aa", "bb", "cc") – > b = c(TRUE, FALSE, TRUE) – > df = data.frame(n, s, b) • Built-in Data Set: – > state.x77 – > st = as.data.frame(state.x77) – > st$Density = st$Population * 1000 / st$Area – > summary(st) – > cor(st) – > pairs(st)
  • 12. 12 STARTING WITH R Population 3000 5500 68 71 40 55 0e+00 5e+05 015000 30005500 Income Illiteracy 0.52.0 6871 Life Exp Murder 2814 4055 HS Grad Frost 0100 0e+005e+05 Area 0 15000 0.5 2.0 2 8 14 0 100 0 600 0600 Density
  • 13. 13 LINEAR REGRESSION MODEL IN R • Linear Regression Model: – > x <- 1:100 – > y <- x^3 – Model y = a + b . x – > lm(y ~ x) – > model <- lm(y ~ x) – > summary(model) – > par(mfrow=c(2,2)) – > plot(model)
  • 14. 14 LM MODEL – Call: – lm(formula = y ~ x) – Residuals: – Min 1Q Median 3Q Max – -129827 -103680 -29649 85058 292030 – Coefficients: – Estimate Std. Error t value Pr(>|t|) – (Intercept) -207070.2 23299.3 -8.887 3.14e-14 *** – x 9150.4 400.6 22.844 < 2e-16 *** – --- – Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 – Residual standard error: 115600 on 98 degrees of freedom – Multiple R-squared: 0.8419, Adjusted R-squared: 0.8403 – F-statistic: 521.9 on 1 and 98 DF, p-value: < 2.2e-16
  • 15. 15 LM MODEL 0 20 40 60 80 100 0e+002e+054e+056e+058e+051e+06 y=x^3 x y
  • 16. 16 DIAGNOSIS PLOT -2e+05 2e+05 4e+05 6e+05 -1e+051e+053e+05 Fitted values Residuals Residuals vs Fitted 100 99 98 -2 -1 0 1 2 -10123 Theoretical Quantiles Standardizedresiduals Normal Q-Q 100 99 98 -2e+05 2e+05 4e+05 6e+05 0.00.51.01.5 Fitted values Standardizedresiduals Scale-Location 100 99 98 0.00 0.01 0.02 0.03 0.04 -10123 Leverage Standardizedresiduals Cook's distance Residuals vs Leverage 100 99 98
  • 17. 17 LINEAR REGRESSION MODEL IN R • Model Built-in Data: – > colnames(st)[4] = "Life.Exp" – > colnames(st)[6] = "HS.Grad" – model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area + Density, data=st) – > summary(model1) – > model2 <- step(model1) – > model3 = update(model2, .~.-Population) – > Summary(model3)
  • 18. 18 LINEAR REGRESSION MODEL IN R • Confidence limits on Estimated Coefficients: – > confint(model3) – > predict(model3, list(Murder=10.5, HS.Grad=48, Frost=100))
  • 19. 19 OUTLIERS • Boxplot: – > v <- rnorm(100) – > v = c(v,10) – > boxplot(v) – > rug(jitter(v), side=2) -20246810
  • 20. 20 PROBABILITY DENSITY FUNCTION • PDF: – > r <- rnorm(1000) – > hist(r, prob=T) – > lines(density(r), col="red") Histogram of r r Density -3 -2 -1 0 1 2 3 0.00.10.20.30.4
  • 21. 21 CASE STUDY: SHARETHIS EXAMPLE • Relationship of clicks with winning price and Impression on ADX: • Data – Analyzed ADX Hourly Impression Logs • Method – Detected outliers – Predicted clicks using a regression tree model
  • 22. 22 CASE STUDY: SHARETHIS EXAMPLE • Outlier Detection: Clicks Impressions
  • 23. 23 CASE STUDY: SHARETHIS EXAMPLE • Regression Tree – One of the most powerful classification/regression – > library(rpart) – > fit <- rpart(log(CLK) ~ log(IMP) + AVG_PRICE + SD_PRICE, data=x) – > plot(fit) – > text(fit) – > plot(predict(fit), log(x$CLK))
  • 24. 24 CASE STUDY: SHARETHIS EXAMPLE • Regression Tree | log(IMP)< 9.33 log(IMP)< 8.349 log(IMP)< 11.28 SD_PRICE< 0.2604 log(IMP)>=10.04 log(IMP)< 10.39 AVG_PRICE>=1.713 AVG_PRICE>=1.247 AVG_PRICE< 0.8555 log(IMP)< 12.49 0.751 1.387 1.541 2.869 1.959 2.729 3.003 3.104 4.331 3.577 4.753
  • 25. 25 CASE STUDY: SHARETHIS EXAMPLE • Predict Log of Clicks 0 1 2 3 4 5 6 7 1234 log(x$CLK) predict(fit)
  • 26. 26 CASE STUDY: COLOR DETECTION • Detect color from product image: -1.0 -0.5 0.0 0.5 1.0 -1.0-0.50.00.51.0 -1.0 -0.5 0.0 0.5 1.0 -1.0-0.50.00.51.0 -1.0 -0.5 0.0 0.5 1.0 -1.0-0.50.00.51.0
  • 27. 27 RESOURCES • Books: – An Introduction to Statistical Learning: with Applications in R by G. James, D. Witten, T. Hatie, R. Tibshirani, 2013 – The Art of R Programming: A Tour of Statistical Software Design, N. Matloff, 2011 – R Cookbook (O'Reilly Cookbooks), P. Teetor, 2011 • R Blog: – http://www.r-bloggers.com

Editor's Notes

  • #2: Client Interview Position the upcoming as introductory and a launching pad for further exploration To get started, want to share a brief video that’s been helpful for our partners …