SlideShare a Scribd company logo
INTRODUCTION TO R AND RATTLE
1IAUSHIRAZ1/14/2017
What is the R
Statistical Programming Language
used among statisticians and data miners for developing statistical software and data analysis.
Free and Open Source
Written in C, Fortran and R
Statistical features
Linear and nonlinear modeling
Statistical tests
Classification, Clustering
Can manipulate R Objects with C, C++, Java, .NET or Python code.
2IAUSHIRAZ1/14/2017
Source Example
> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x
> print(y) # print (vector) y
[1] 1 4 9 16 25 36
> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y) # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
# of the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.3333 2.8441 -3.282 0.030453 *
x 7.0000 0.7303 9.585 0.000662 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow=c(2, 2)) # Request 2x2 plot layout
> plot(lm_1) # Diagnostic plot of regression model
3IAUSHIRAZ1/14/2017
Graphical front-ends
Architect – cross-platform open source IDE based on Eclipse and StatET
DataJoy – Online R Editor focused on beginners to data science and collaboration.
Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).
Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).
Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.
Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.
R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also
available).
Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE,
and has plans for web based point and click interface.
RGUI – comes with the pre-compiled version of R for Microsoft Windows.
RKWard – extensible GUI and IDE for R.
RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).
4IAUSHIRAZ1/14/2017
What is the Rattle
R Graphical User Interface Package
Offered by Graham Williams in Togaware Pty Ltd.
Free and Open Source
Represents Statistical and Visual Summaries of data
Tabs :
Load Data
Data Exploration
Model
Evaluation
Test
…
5IAUSHIRAZ1/14/2017
Rattle Installation Process
Download and Installing R
https://r-project.org
About 60MB
Download the Rattle Package
About 300MB
Follow Instructions :
 install.packages("rattle", dependencies=c("Depends", "Suggests"))
 Library(rattle)
 Rattle()
6IAUSHIRAZ1/14/2017
Load Data
Dataset Types :
CSV File (CSV, TXT, EXCELL)
ARFF (CSV File which adds type information)
ODBC (MySQL, SqlLITE, SQL Server, …)
 Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini
R Dataset (Existing Datasets in Current Solution)
R Data File
Library (Pre Existing Datasets)
Corpus ( Collection of Documents)
Script (Scripts for Generating Datasets)
1/14/2017 IAUSHIRAZ 7
Load Data
Variable Types :
Input (Most Variables as Input)
 Predict the Target Variables
Target (Influenced by the Input Variables)
 Known as the Output
 Prefix : TARGET_
Risk (Measure of the size of the Targets)
 Prefix : RISK_
Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)
 Such as : ID, Date
 Prefix : ID_
Ignore (Ignore from Modeling)
 Prefix : IGNORE_
Weight (Weighted by R Formula)
1/14/2017 IAUSHIRAZ 8
Transform
Rescale
Normalize
 Re Center
 Scale [0-1]
 Median/Mad
 Natural Log / Log 10
 Matrix
Order
 Rank
 Interval
 Number of Group
1/14/2017 IAUSHIRAZ 9
Transform
Impute (missing values)
Zero
Mean
Median
Mode
Constant
Recode
Quantiles
K-Means
Equal with
Indicator variable / Join Categories
As Categorical / As Numeric
1/14/2017 IAUSHIRAZ 10
Transform
Cleanup
Delete Ignored
Delete Selected
Delete Missing
Delete Observations with Missing
1/14/2017 IAUSHIRAZ 11
Exploration
Summary
Summary
 Min, Max, Mean, Quartiles Values.
Describe
 Missing, Unique, Sum, Mean, Lowest, Highest Values.
Basics (For Numeric Value)
 Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)
Kurtosis (For Numeric Value)
 A larger value indicates a sharper peak.
 A lower value indicates a smoother peak.
Skewness (For Numeric Value)
 A positive skew indicates that the tail to the right is longer.
 A negative skew that the tail to the left is longer.
1/14/2017 IAUSHIRAZ 12
Exploration
Summary
Show Missing
 Each row corresponds to a pattern of missing values.
 Perhaps coming to an understanding of why the data is missing.
 Rows and Columns are sorted in ascending order of missing data.
1/14/2017 IAUSHIRAZ 13
Exploration
Distributions (review the distributions of each variable in dataset)
Annotate (include numeric values in plots)
Group by
Numeric Outputs :
 Box Plot
 Histogram
 Cumulative
 Benford
 For any number of continuous variables
 Pairs
Categorical Outputs :
 Bar Plot
 Dot Plot
 Mosaic
 Pairs
1/14/2017 IAUSHIRAZ 14
Exploration
Correlations (Rattle only computes correlations between numeric variables at this time)
Ordered
 Order by strength of correlations
Explore Missing
 Correlation between missing values
Hierarchical
 Pearson
 Kendall
 Spearman
Principal Components
SVD
 For only Numeric Variables
Eigen
1/14/2017 IAUSHIRAZ 15
Model
Tree
Traditional
 Trade off between performance and simplicity of explanation
Conditional
Forest (many decision trees using random subsets of data and variables)
Number of Trees
Number of Variables
Impute (set median numeric value for missing values)
Sample Size (for balancing classes)
Importance (variable importance)
Rules (collection of random forest rules)
ROC (ROC Curve)
Errors
1/14/2017 IAUSHIRAZ 16
Model
SVM
Start with two parallel vector
Linear (linear regression)
For continues values
All
1/14/2017 IAUSHIRAZ 17
Cluster
K-Means
Set First K
EwKm
K-Means with entropy weighting
Hierarchical
Not needed to set first Cluster Number
BiCluster
Suitable subsets of both the variables and the observations
1/14/2017 IAUSHIRAZ 18

More Related Content

What's hot (20)

KEY
Presentation R basic teaching module
Sander Timmer
 
PPTX
Data analysis with R
ShareThis
 
PPTX
Presentation on data preparation with pandas
AkshitaKanther
 
PDF
RDataMining slides-r-programming
Yanchang Zhao
 
PPS
Data Structure
sheraz1
 
PDF
Introduction to data analysis using R
Victoria López
 
PPTX
R language
Isra El Isa
 
PPTX
R Get Started I
Sankhya_Analytics
 
PPTX
R language
LearningTech
 
PPTX
R Get Started II
Sankhya_Analytics
 
PDF
Introduction to R
Samuel Bosch
 
PPTX
R language introduction
Shashwat Shriparv
 
PDF
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
 
PPTX
Introduction to pandas
Piyush rai
 
PPT
A brief introduction to 'R' statistical package
Shanmukha S. Potti
 
PPTX
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
PPT
IR-ranking
FELIX75
 
PPTX
Motivation and Mechanics behind some aspects of Shapeless
Anatolii Kmetiuk
 
PDF
R training5
Hellen Gakuruh
 
PPTX
A Presentation About Array Manipulation(Insertion & Deletion in an array)
Imdadul Himu
 
Presentation R basic teaching module
Sander Timmer
 
Data analysis with R
ShareThis
 
Presentation on data preparation with pandas
AkshitaKanther
 
RDataMining slides-r-programming
Yanchang Zhao
 
Data Structure
sheraz1
 
Introduction to data analysis using R
Victoria López
 
R language
Isra El Isa
 
R Get Started I
Sankhya_Analytics
 
R language
LearningTech
 
R Get Started II
Sankhya_Analytics
 
Introduction to R
Samuel Bosch
 
R language introduction
Shashwat Shriparv
 
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
 
Introduction to pandas
Piyush rai
 
A brief introduction to 'R' statistical package
Shanmukha S. Potti
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
IR-ranking
FELIX75
 
Motivation and Mechanics behind some aspects of Shapeless
Anatolii Kmetiuk
 
R training5
Hellen Gakuruh
 
A Presentation About Array Manipulation(Insertion & Deletion in an array)
Imdadul Himu
 

Similar to Rattle Graphical Interface for R Language (20)

PPTX
A Workshop on R
Ajay Ohri
 
PDF
R - the language
Mike Martinez
 
PDF
Data mining with Rattle For R
Akhil Anil
 
PDF
PPT - Introduction to R.pdf
ssuser65af26
 
PDF
R tutorial
Richard Vidgen
 
PDF
Introduction to R programming
Alberto Labarga
 
PDF
Practical data science_public
Long Nguyen
 
PDF
Data Analysis with R (combined slides)
Guy Lebanon
 
PPTX
R language tutorial
David Chiu
 
PPTX
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
PDF
Data Science - Part II - Working with R & R studio
Derek Kane
 
PDF
Introduction to R for data science
Long Nguyen
 
PPT
Introduction to R for Data Science Technology
gufranqureshi506
 
PPT
How to obtain and install R.ppt
rajalakshmi5921
 
PDF
Basic and logical implementation of r language
Md. Mahedi Mahfuj
 
PPT
R studio
Kinza Irshad
 
PPT
Basics of R-Progranmming with instata.ppt
geethar79
 
PPT
17641.ppt
AhmedAbdalla903058
 
PPT
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
PPT
17641.ppt
vikassingh569137
 
A Workshop on R
Ajay Ohri
 
R - the language
Mike Martinez
 
Data mining with Rattle For R
Akhil Anil
 
PPT - Introduction to R.pdf
ssuser65af26
 
R tutorial
Richard Vidgen
 
Introduction to R programming
Alberto Labarga
 
Practical data science_public
Long Nguyen
 
Data Analysis with R (combined slides)
Guy Lebanon
 
R language tutorial
David Chiu
 
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
Data Science - Part II - Working with R & R studio
Derek Kane
 
Introduction to R for data science
Long Nguyen
 
Introduction to R for Data Science Technology
gufranqureshi506
 
How to obtain and install R.ppt
rajalakshmi5921
 
Basic and logical implementation of r language
Md. Mahedi Mahfuj
 
R studio
Kinza Irshad
 
Basics of R-Progranmming with instata.ppt
geethar79
 
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
17641.ppt
vikassingh569137
 
Ad

Recently uploaded (20)

PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Ad

Rattle Graphical Interface for R Language

  • 1. INTRODUCTION TO R AND RATTLE 1IAUSHIRAZ1/14/2017
  • 2. What is the R Statistical Programming Language used among statisticians and data miners for developing statistical software and data analysis. Free and Open Source Written in C, Fortran and R Statistical features Linear and nonlinear modeling Statistical tests Classification, Clustering Can manipulate R Objects with C, C++, Java, .NET or Python code. 2IAUSHIRAZ1/14/2017
  • 3. Source Example > x <- c(1,2,3,4,5,6) # Create ordered collection (vector) > y <- x^2 # Square the elements of x > print(y) # print (vector) y [1] 1 4 9 16 25 36 > mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar [1] 15.16667 > var(y) # Calculate sample variance [1] 178.9667 > lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)" # store the results as lm_1 > print(lm_1) # Print the model from the (linear model object) lm_1 Call: lm(formula = y ~ x) Coefficients: (Intercept) x -9.333 7.000 > summary(lm_1) # Compute and print statistics for the fit # of the (linear model object) lm_1 Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5 6 3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.3333 2.8441 -3.282 0.030453 * x 7.0000 0.7303 9.585 0.000662 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.055 on 4 degrees of freedom Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478 F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662 > par(mfrow=c(2, 2)) # Request 2x2 plot layout > plot(lm_1) # Diagnostic plot of regression model 3IAUSHIRAZ1/14/2017
  • 4. Graphical front-ends Architect – cross-platform open source IDE based on Eclipse and StatET DataJoy – Online R Editor focused on beginners to data science and collaboration. Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab). Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR). Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud. Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining. R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also available). Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE, and has plans for web based point and click interface. RGUI – comes with the pre-compiled version of R for Microsoft Windows. RKWard – extensible GUI and IDE for R. RStudio – cross-platform open source IDE (which can also be run on a remote Linux server). 4IAUSHIRAZ1/14/2017
  • 5. What is the Rattle R Graphical User Interface Package Offered by Graham Williams in Togaware Pty Ltd. Free and Open Source Represents Statistical and Visual Summaries of data Tabs : Load Data Data Exploration Model Evaluation Test … 5IAUSHIRAZ1/14/2017
  • 6. Rattle Installation Process Download and Installing R https://r-project.org About 60MB Download the Rattle Package About 300MB Follow Instructions :  install.packages("rattle", dependencies=c("Depends", "Suggests"))  Library(rattle)  Rattle() 6IAUSHIRAZ1/14/2017
  • 7. Load Data Dataset Types : CSV File (CSV, TXT, EXCELL) ARFF (CSV File which adds type information) ODBC (MySQL, SqlLITE, SQL Server, …)  Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini R Dataset (Existing Datasets in Current Solution) R Data File Library (Pre Existing Datasets) Corpus ( Collection of Documents) Script (Scripts for Generating Datasets) 1/14/2017 IAUSHIRAZ 7
  • 8. Load Data Variable Types : Input (Most Variables as Input)  Predict the Target Variables Target (Influenced by the Input Variables)  Known as the Output  Prefix : TARGET_ Risk (Measure of the size of the Targets)  Prefix : RISK_ Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)  Such as : ID, Date  Prefix : ID_ Ignore (Ignore from Modeling)  Prefix : IGNORE_ Weight (Weighted by R Formula) 1/14/2017 IAUSHIRAZ 8
  • 9. Transform Rescale Normalize  Re Center  Scale [0-1]  Median/Mad  Natural Log / Log 10  Matrix Order  Rank  Interval  Number of Group 1/14/2017 IAUSHIRAZ 9
  • 10. Transform Impute (missing values) Zero Mean Median Mode Constant Recode Quantiles K-Means Equal with Indicator variable / Join Categories As Categorical / As Numeric 1/14/2017 IAUSHIRAZ 10
  • 11. Transform Cleanup Delete Ignored Delete Selected Delete Missing Delete Observations with Missing 1/14/2017 IAUSHIRAZ 11
  • 12. Exploration Summary Summary  Min, Max, Mean, Quartiles Values. Describe  Missing, Unique, Sum, Mean, Lowest, Highest Values. Basics (For Numeric Value)  Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis) Kurtosis (For Numeric Value)  A larger value indicates a sharper peak.  A lower value indicates a smoother peak. Skewness (For Numeric Value)  A positive skew indicates that the tail to the right is longer.  A negative skew that the tail to the left is longer. 1/14/2017 IAUSHIRAZ 12
  • 13. Exploration Summary Show Missing  Each row corresponds to a pattern of missing values.  Perhaps coming to an understanding of why the data is missing.  Rows and Columns are sorted in ascending order of missing data. 1/14/2017 IAUSHIRAZ 13
  • 14. Exploration Distributions (review the distributions of each variable in dataset) Annotate (include numeric values in plots) Group by Numeric Outputs :  Box Plot  Histogram  Cumulative  Benford  For any number of continuous variables  Pairs Categorical Outputs :  Bar Plot  Dot Plot  Mosaic  Pairs 1/14/2017 IAUSHIRAZ 14
  • 15. Exploration Correlations (Rattle only computes correlations between numeric variables at this time) Ordered  Order by strength of correlations Explore Missing  Correlation between missing values Hierarchical  Pearson  Kendall  Spearman Principal Components SVD  For only Numeric Variables Eigen 1/14/2017 IAUSHIRAZ 15
  • 16. Model Tree Traditional  Trade off between performance and simplicity of explanation Conditional Forest (many decision trees using random subsets of data and variables) Number of Trees Number of Variables Impute (set median numeric value for missing values) Sample Size (for balancing classes) Importance (variable importance) Rules (collection of random forest rules) ROC (ROC Curve) Errors 1/14/2017 IAUSHIRAZ 16
  • 17. Model SVM Start with two parallel vector Linear (linear regression) For continues values All 1/14/2017 IAUSHIRAZ 17
  • 18. Cluster K-Means Set First K EwKm K-Means with entropy weighting Hierarchical Not needed to set first Cluster Number BiCluster Suitable subsets of both the variables and the observations 1/14/2017 IAUSHIRAZ 18

Editor's Notes

  • #16: The intensity of the color is maximal for a perfect correlation, and minimal (white) if there is no correlation. Shades of red are used for negative correlations and blue for positive correlations.