SlideShare a Scribd company logo
R Text-Based Data I/O R Data Frame Access and Manipulation Ian M. Cook September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Background Information
Data Types R has several important data types: numeric (stores integers and floating point real numbers) character (stores strings of characters, not single characters) logical (stores TRUE or FALSE) R Data I/O, Access, and Manipulation September 29, 2010
Data Containers The most basic data storage container in R is a  scalar , a 1x1 unit of data.  A scalar might contain a unit of numeric, character, or logical data. A 1-dimensional array of scalars in R is a  vector . A 2-dimensional array of scalars in R can be a  matrix  or a  data frame .  (The focus here is on data frames.  Matrices are often less useful and less accessible so are not covered in this presentation.) R also has other data containers, including  lists , which are important to know about but are often less useful for data analysis purposes. R Data I/O, Access, and Manipulation September 29, 2010
Data Containers A vector can be created in R using the function  c() .  To create several vectors of various lengths containing numerical, character, and logical data, we can enter v1 <- c(1, 3, 9, 3.14159, -88.1, 0) v2 <- c(&quot;abc&quot;,&quot;def&quot;,&quot;ghi&quot;) v3 <- c(TRUE, FALSE, TRUE, TRUE) Data types cannot be mixed  within a vector.  Entering mixed data types into a vector using the  c()  function converts all non-character entries into character representations. R Data I/O, Access, and Manipulation September 29, 2010
Data Frames A  data frame  is a rectangular array, with each column representing a variable. Different  columns in a data frame may have different data types.  (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.) A data frame can be created in R using the function  data.frame(),  but it is often more useful to input a data frame from an external data file or database. R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Input/Output
Basic CSV Data Input To  read  the contents of a CSV file into an R data frame named  ds , use the command ds <- read.csv(file, header, …) header  is TRUE by default, indicating that the first row of the CSV file contains the row names. file  is the name of the file, enclosed in single or double quotes. Example: ds <- read.csv(&quot;C:/data/file.csv&quot;, header=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips When specifying file paths, use front slashes or  double  backslashes.  (The single backslash is a special character in R.) Works: ds <- read.csv(&quot;C: / data / file.csv&quot;) Works: ds <- read.csv(&quot;C: \\ data \\ file.csv&quot;) Fails: ds <- read.csv(&quot;C: \ data \ file.csv&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Other Delimited Text Files To input a text data table delimited with characters other than commas, use the command ds <- read.table(file, header, sep, …) sep  specifies the delimiter: &quot;,&quot;  indicates a comma &quot;\t&quot;  indicates the tab character For example: ds <- read.table(&quot;C:/file.txt&quot;, sep=&quot;\t&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips The logical values  TRUE  and  FALSE  must be all caps. If a data frame with named  ds  already exists, the command  ds <- read.csv(…)  or any other command using  ds  on the left side of the assignment operator  <-  will  overwrite   ds  if it executes successfully. Refer to the R Documentation page on  read.table(…)  for more detailed information and for other options such as ignoring comment headers and using special quotation characters. R Data I/O, Access, and Manipulation September 29, 2010
CSV Data Output To  write  the contents of a data frame named ds to a CSV file, use the command write.csv(ds, file, …) For example: write.csv(ds, &quot;C:/data/file.csv&quot;) To output a file delimited by a character other than the comma, use the command write.table(ds, file, … , sep) R Data I/O, Access, and Manipulation September 29, 2010
Important Tips The functions  write.csv(…)  and  write.table(…)  have many options, including  col.names  and  row.names , which allow users to choose whether to use column naming and/or row numbering. Refer to the R Documentation on  write.table(…)  for more information. R Data I/O, Access, and Manipulation September 29, 2010
Databases R has simple facilities for querying databases and filling a data frame with the results of your query. R can query  MySQL  databases using the R package  RMySQL . R can query  Oracle  databases using the R package  ROracle . Queries to either database type require the R package  DBI . R Data I/O, Access, and Manipulation September 29, 2010
MySQL Databases To fill a data frame  ds  with the results of a SQL query against a MySQL database, use the following template R code: library(DBI) library(RMySQL) db_name <- &quot;database_name&quot; db_node <- &quot;database_node&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; mysql <- dbDriver(&quot;MySQL&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(mysql, user=db_user, password=db_pw,  dbname=db_name, host=db_node) ds <- dbGetQuery(con, sql_statement) mysqlCloseConnection(con) R Data I/O, Access, and Manipulation September 29, 2010
Oracle Databases To fill a data frame  ds  with the results of a SQL query against an Oracle database, use the following template R code: library(DBI) library(ROracle) db_name <- &quot;database_name&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; ora <- dbDriver(&quot;Oracle&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(ora, user=db_user, password=db_pw,  dbname=db_name) ds <- dbGetQuery(con, sql_statement) dbDisconnect(con) R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Access and Manipulation
Accessing Columns in a Data Frame Each column in a data frame represents a variable.  Different  columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3). Columns inside a data frame can be accessed in any of three basic methods: Dollar sign extraction operator  $ Square brackets extraction operator  [] subset()  function R Data I/O, Access, and Manipulation September 29, 2010
Dollar Sign Extraction Operator A single column from a data frame can be accessed using the dollar sign operator  $  as follows.  To return a vector containing the data in the column named  SIDD  in the data frame named  ds , issue the command ds$SIDD Do not  surround the name of the column in quotes when using the  $  operator. R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator A single column from a data frame may  also  be accessed using the square brackets operator  []  as follows.  To return a vector containing the column named  SIDD  in the data frame named  ds , issue the command ds[,&quot;SIDD&quot;] You must  surround the name of the column in double or single quotes when using the  []  operator. The comma before the column name is important, as you will see several slides ahead. R Data I/O, Access, and Manipulation September 29, 2010
subset()  Function A third way to access a single column in a data frame utilizes R’s  subset()  function.  To return a vector containing the column named  SIDD  in the data frame named  ds , issue the command subset(ds, select=&quot;SIDD&quot;) R Data I/O, Access, and Manipulation September 29, 2010
Numerical Indices R indexes data containers with integers, beginning at  1 . This is unlike most programming languages, in which indices begin at 0. The square brackets extraction operator also accepts the  number  of the column.  If the third column in the data frame  ds  is named  SIDD , then ds[,&quot;SIDD&quot;]   and  ds[,3] are equivalent commands. R Data I/O, Access, and Manipulation September 29, 2010
Accessing Rows in a Data Frame The rows of a data frame are not generally named, but are numbered beginning at 1. The rows of a data frame can be accessed by either of two methods: Square brackets extraction operator  [] subset()  function R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator To return a vector containing the  n th row of a data frame  ds , issue the command ds[n,] The comma after the column name is important.  The square brackets expect a  row  number  before  the comma, and a  column  name or number  after  the comma. R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator Square brackets can also be used to return  multiple rows  of a data frame.  To return a smaller data frame containing the  n th through  n+m th rows of a data frame  ds , issue the command ds[n:(n+m),] The above command also demonstrates the colon operator  : , which is used to create sequences of integer numbers, in this case beginning with  n  and ending with  n+m . R Data I/O, Access, and Manipulation September 29, 2010
subset()  Function The  subset()  function is sometimes useful in returning multiple rows of a data frame.  It is more complicated to use than the square brackets. For example, to extract the 2 nd , 4 th , and 5 th  rows of a data frame with 5 rows, we could issue the commands: index <- c(FALSE, TRUE, FALSE, TRUE, TRUE) subset(ds, subset=index) R Data I/O, Access, and Manipulation September 29, 2010
Square Brackets Extraction Operator An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma. To return the scalar value in the  m th row and  n th column of a data frame  ds , issue the command ds[m,n] To return the scalar value in the  m th row of the data frame  ds , in the column named  SIDD , issue the command ds[m,&quot;SIDD&quot;] R Data I/O, Access, and Manipulation September 29, 2010
Assignment with  []  and  $ The square brackets and dollar sign can also be used to  assign  values within a data frame.  If the column  SIDD  in the data frame  ds  contains numerical data, we can multiply the 5 th  entry in the  SIDD  column by two by issuing the command ds[5,&quot;SIDD&quot;] <- 2 * ds[5,&quot;SIDD&quot;] We could create a new column (or replace the values within the column) named  TWICE_SIDD  in the data frame  ds , and fill it with values twice those in the column  SIDD , by issuing the command ds$TWICE_SIDD <- 2 * ds$SIDD R Data I/O, Access, and Manipulation September 29, 2010
Dimensions Commands to return the dimensions of a data frame  ds  are dim(ds)  nrow(ds)  ncol(ds) dim(ds)  returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2. The command to return the length of a vector  v  is length(v) R Data I/O, Access, and Manipulation September 29, 2010
Factors By default, R stores the character vector columns in data frames as  factors .  In R, a factor is an indexed vector. To factor a vector, R identifies the unique entries in the vector and makes them the  levels  of the factor.  Each vector entry is then indexed by an integer to one of the factor levels.  This saves memory when the entries in a vector are not all unique. There are several functions to handle factors.  Refer to the R Documentation or Help pages about factors. R Data I/O, Access, and Manipulation September 29, 2010
R Data I/O, Access, and Manipulation September 29, 2010 Connections and Line-by-Line Text Input/Output
Connections In some cases, it is preferable to import or export data  line-by-line . Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets. Line-by-line text input/output can be useful for reading and writing log files. The first step in  reading  line-by-line is opening a file  connection . R Data I/O, Access, and Manipulation September 29, 2010
Connections for Input R can open a text file connection  conn  for  input  using the command conn <- file(filename, open=&quot;rt&quot;) If the specified file exists and is accessible, then a connection is created and opened for text reading. Example: conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) (&quot;rt&quot;  indicates “read text”) R Data I/O, Access, and Manipulation September 29, 2010
Line-by-Line Input Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n) scan(…) The  scan(…)  function is useful for importing delimited data files (e.g. CSV) line by line.  The  scan(…)  function has many arguments.  Refer to its lengthy R Documentation page for details. The  readLines(…)  function is simpler and is useful for reading unstructured lines of text. R Data I/O, Access, and Manipulation September 29, 2010
Line-by-Line Input To read one line of text from a file into the scalar character array variable  str , we could use the following series of commands conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) str <- readLines(conn, n=1) close(conn) The  close(conn)  command closes the connection, leaving the file intact, and leaving  str  in the R workspace. R Data I/O, Access, and Manipulation September 29, 2010
Connections for Output R can create a text file connection  conn  for  output  using the command conn <- file(filename, open=&quot;wt&quot;) If the file does not exist, it is created.  If the file already exists, its contents are erased ! Example: conn <- file( &quot;C:/data/out.txt&quot; , open=&quot;wt&quot;) (&quot;wt&quot;  indicates “write text”) R Data I/O, Access, and Manipulation September 29, 2010
Output to a Connection Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function write(&quot;text to write&quot;, file=conn,  append=TRUE) Once finished writing text to the connection, close it using the command close(conn) R Data I/O, Access, and Manipulation September 29, 2010
Output to a File R can also write directly to a file without creating a connection.  In this example, we  retain the contents of an existing text file  and append new text. To write the contents of the character string  str  to a file, issue the command write(str, file=filename, append=TRUE) Example: str <- &quot;some text to output \nline 2&quot; write(str, file=&quot;C:/data/out.txt&quot;, append=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
Output to a File If the specified file does not exist, the  write()  command will create it. Be sure to use the  append=TRUE  option when appending to an existing text file, or the file’s contents will be cleared! There is no need to use the  close()  command after writing to a file without using a connection, because no persistent connection has been opened. Use the newline character  \n  to create line breaks in text output. R Data I/O, Access, and Manipulation September 29, 2010
Gzip Connections R provides facilities for line-by-line reading and writing of files compressed by the  gzip   utility. To create a connection to a gzip file for  reading , issue the command conn <- gzfile(filename, open=&quot;rt&quot;) To create a connection to a gzip file for  writing , issue the command   conn <- gzfile(filename, open=&quot;wt&quot;) The  readLines() ,  write() , and  close()  functions can be used in the same way as with text file connections. R Data I/O, Access, and Manipulation September 29, 2010

More Related Content

PDF
Matlab files
Sourabh Bhattacharya
 
PDF
SQL For PHP Programmers
Dave Stokes
 
PPT
R programming by ganesh kavhar
Savitribai Phule Pune University
 
PPT
Sas Plots Graphs
guest2160992
 
PPT
SAS Proc SQL
guest2160992
 
PPT
Reading Fixed And Varying Data
guest2160992
 
PDF
Sas cheat
imaduddin91
 
PPT
SAS Functions
guest2160992
 
Matlab files
Sourabh Bhattacharya
 
SQL For PHP Programmers
Dave Stokes
 
R programming by ganesh kavhar
Savitribai Phule Pune University
 
Sas Plots Graphs
guest2160992
 
SAS Proc SQL
guest2160992
 
Reading Fixed And Varying Data
guest2160992
 
Sas cheat
imaduddin91
 
SAS Functions
guest2160992
 

What's hot (19)

PPT
Physical elements of data
Dimara Hakim
 
PPT
Improving Effeciency with Options in SAS
guest2160992
 
PPT
Data structure
Mohd Arif
 
PPTX
Basic Structure Of C++
DevangiParekh1
 
PPTX
Co&amp;al lecture-05
AbdulKarim563520
 
PDF
Introduction to matlab
Sourabh Bhattacharya
 
PPT
SAS Macros
guest2160992
 
PDF
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
AminaRepo
 
PPTX
Adbms 22 dynamic multi level index using b and b+ tree
Vaibhav Khanna
 
PPT
Intro To TSQL - Unit 5
iccma
 
PDF
Abap Questions
Kaustav Pyne
 
PPT
Trees
Susheel Thakur
 
PPTX
Data frame operations
19MSS011dhanyatha
 
PPTX
Sql rally 2013 columnstore indexes
Денис Резник
 
PPT
Trees - Data structures in C/Java
geeksrik
 
PPTX
Using Spectrum on Demand from MapInfo Pro
Peter Horsbøll Møller
 
PPTX
SqlSaturday199 - Columnstore Indexes
Денис Резник
 
PPT
Introductiont To Aray,Tree,Stack, Queue
Ghaffar Khan
 
Physical elements of data
Dimara Hakim
 
Improving Effeciency with Options in SAS
guest2160992
 
Data structure
Mohd Arif
 
Basic Structure Of C++
DevangiParekh1
 
Co&amp;al lecture-05
AbdulKarim563520
 
Introduction to matlab
Sourabh Bhattacharya
 
SAS Macros
guest2160992
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
AminaRepo
 
Adbms 22 dynamic multi level index using b and b+ tree
Vaibhav Khanna
 
Intro To TSQL - Unit 5
iccma
 
Abap Questions
Kaustav Pyne
 
Data frame operations
19MSS011dhanyatha
 
Sql rally 2013 columnstore indexes
Денис Резник
 
Trees - Data structures in C/Java
geeksrik
 
Using Spectrum on Demand from MapInfo Pro
Peter Horsbøll Møller
 
SqlSaturday199 - Columnstore Indexes
Денис Резник
 
Introductiont To Aray,Tree,Stack, Queue
Ghaffar Khan
 
Ad

Similar to R Text-Based Data I/O and Data Frame Access and Manupulation (20)

PPTX
Unit 3
Piyush Rochwani
 
PPTX
Lab 2 - Managing Data in R Basic Conecpt.pptx
noman297489
 
PPT
Basics R.ppt
AtulTandan
 
PPTX
Data analytics with R
Dr. C.V. Suresh Babu
 
PPT
Basics.ppt
ArchishaKhandareSS20
 
PPT
Basics.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ratnapatil14
 
PPT
R-programming with example representation.ppt
geethar79
 
PPT
R Programming for Statistical Applications
drputtanr
 
PPT
Sql intro & ddl 1
Dr. C.V. Suresh Babu
 
PPT
Sql intro & ddl 1
Dr. C.V. Suresh Babu
 
PPTX
Sql fundamentals
Ravinder Kamboj
 
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
dataKarthik
 
PPTX
429cf300-0dc7-4c2e-9280-d918d69e3cb4.pptx
Harmanjot5678
 
PDF
PT- Oracle session01
Karthik Venkatachalam
 
DOCX
Database Management Lab -SQL Queries
shamim hossain
 
PPTX
Introduction to R _IMPORTANT FOR DATA ANALYTICS
HaritikaChhatwal1
 
PPT
Lecture1_R Programming Introduction1.ppt
premak23
 
PPT
PO WER - Piotr Mariat - Sql
Zespół Szkół nr 26
 
Lab 2 - Managing Data in R Basic Conecpt.pptx
noman297489
 
Basics R.ppt
AtulTandan
 
Data analytics with R
Dr. C.V. Suresh Babu
 
Basics.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ratnapatil14
 
R-programming with example representation.ppt
geethar79
 
R Programming for Statistical Applications
drputtanr
 
Sql intro & ddl 1
Dr. C.V. Suresh Babu
 
Sql intro & ddl 1
Dr. C.V. Suresh Babu
 
Sql fundamentals
Ravinder Kamboj
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
dataKarthik
 
429cf300-0dc7-4c2e-9280-d918d69e3cb4.pptx
Harmanjot5678
 
PT- Oracle session01
Karthik Venkatachalam
 
Database Management Lab -SQL Queries
shamim hossain
 
Introduction to R _IMPORTANT FOR DATA ANALYTICS
HaritikaChhatwal1
 
Lecture1_R Programming Introduction1.ppt
premak23
 
PO WER - Piotr Mariat - Sql
Zespół Szkół nr 26
 
Ad

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of Artificial Intelligence (AI)
Mukul
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Doc9.....................................
SofiaCollazos
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 

R Text-Based Data I/O and Data Frame Access and Manupulation

  • 1. R Text-Based Data I/O R Data Frame Access and Manipulation Ian M. Cook September 29, 2010
  • 2. R Data I/O, Access, and Manipulation September 29, 2010 Background Information
  • 3. Data Types R has several important data types: numeric (stores integers and floating point real numbers) character (stores strings of characters, not single characters) logical (stores TRUE or FALSE) R Data I/O, Access, and Manipulation September 29, 2010
  • 4. Data Containers The most basic data storage container in R is a scalar , a 1x1 unit of data. A scalar might contain a unit of numeric, character, or logical data. A 1-dimensional array of scalars in R is a vector . A 2-dimensional array of scalars in R can be a matrix or a data frame . (The focus here is on data frames. Matrices are often less useful and less accessible so are not covered in this presentation.) R also has other data containers, including lists , which are important to know about but are often less useful for data analysis purposes. R Data I/O, Access, and Manipulation September 29, 2010
  • 5. Data Containers A vector can be created in R using the function c() . To create several vectors of various lengths containing numerical, character, and logical data, we can enter v1 <- c(1, 3, 9, 3.14159, -88.1, 0) v2 <- c(&quot;abc&quot;,&quot;def&quot;,&quot;ghi&quot;) v3 <- c(TRUE, FALSE, TRUE, TRUE) Data types cannot be mixed within a vector. Entering mixed data types into a vector using the c() function converts all non-character entries into character representations. R Data I/O, Access, and Manipulation September 29, 2010
  • 6. Data Frames A data frame is a rectangular array, with each column representing a variable. Different columns in a data frame may have different data types. (E.g. a data frame might have character strings in column 1, numerical values in column 2, and logical values in column 3.) A data frame can be created in R using the function data.frame(), but it is often more useful to input a data frame from an external data file or database. R Data I/O, Access, and Manipulation September 29, 2010
  • 7. R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Input/Output
  • 8. Basic CSV Data Input To read the contents of a CSV file into an R data frame named ds , use the command ds <- read.csv(file, header, …) header is TRUE by default, indicating that the first row of the CSV file contains the row names. file is the name of the file, enclosed in single or double quotes. Example: ds <- read.csv(&quot;C:/data/file.csv&quot;, header=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
  • 9. Important Tips When specifying file paths, use front slashes or double backslashes. (The single backslash is a special character in R.) Works: ds <- read.csv(&quot;C: / data / file.csv&quot;) Works: ds <- read.csv(&quot;C: \\ data \\ file.csv&quot;) Fails: ds <- read.csv(&quot;C: \ data \ file.csv&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 10. Other Delimited Text Files To input a text data table delimited with characters other than commas, use the command ds <- read.table(file, header, sep, …) sep specifies the delimiter: &quot;,&quot; indicates a comma &quot;\t&quot; indicates the tab character For example: ds <- read.table(&quot;C:/file.txt&quot;, sep=&quot;\t&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 11. Important Tips The logical values TRUE and FALSE must be all caps. If a data frame with named ds already exists, the command ds <- read.csv(…) or any other command using ds on the left side of the assignment operator <- will overwrite ds if it executes successfully. Refer to the R Documentation page on read.table(…) for more detailed information and for other options such as ignoring comment headers and using special quotation characters. R Data I/O, Access, and Manipulation September 29, 2010
  • 12. CSV Data Output To write the contents of a data frame named ds to a CSV file, use the command write.csv(ds, file, …) For example: write.csv(ds, &quot;C:/data/file.csv&quot;) To output a file delimited by a character other than the comma, use the command write.table(ds, file, … , sep) R Data I/O, Access, and Manipulation September 29, 2010
  • 13. Important Tips The functions write.csv(…) and write.table(…) have many options, including col.names and row.names , which allow users to choose whether to use column naming and/or row numbering. Refer to the R Documentation on write.table(…) for more information. R Data I/O, Access, and Manipulation September 29, 2010
  • 14. Databases R has simple facilities for querying databases and filling a data frame with the results of your query. R can query MySQL databases using the R package RMySQL . R can query Oracle databases using the R package ROracle . Queries to either database type require the R package DBI . R Data I/O, Access, and Manipulation September 29, 2010
  • 15. MySQL Databases To fill a data frame ds with the results of a SQL query against a MySQL database, use the following template R code: library(DBI) library(RMySQL) db_name <- &quot;database_name&quot; db_node <- &quot;database_node&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; mysql <- dbDriver(&quot;MySQL&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(mysql, user=db_user, password=db_pw, dbname=db_name, host=db_node) ds <- dbGetQuery(con, sql_statement) mysqlCloseConnection(con) R Data I/O, Access, and Manipulation September 29, 2010
  • 16. Oracle Databases To fill a data frame ds with the results of a SQL query against an Oracle database, use the following template R code: library(DBI) library(ROracle) db_name <- &quot;database_name&quot; db_user <- &quot;username&quot; db_pw <- &quot;password&quot; ora <- dbDriver(&quot;Oracle&quot;) sql_statement <- &quot;select … from …&quot; con <- dbConnect(ora, user=db_user, password=db_pw, dbname=db_name) ds <- dbGetQuery(con, sql_statement) dbDisconnect(con) R Data I/O, Access, and Manipulation September 29, 2010
  • 17. R Data I/O, Access, and Manipulation September 29, 2010 Data Frame Access and Manipulation
  • 18. Accessing Columns in a Data Frame Each column in a data frame represents a variable. Different columns may have different data types (e.g. character strings in column 1, numerical values in column 2, logical values in column 3). Columns inside a data frame can be accessed in any of three basic methods: Dollar sign extraction operator $ Square brackets extraction operator [] subset() function R Data I/O, Access, and Manipulation September 29, 2010
  • 19. Dollar Sign Extraction Operator A single column from a data frame can be accessed using the dollar sign operator $ as follows. To return a vector containing the data in the column named SIDD in the data frame named ds , issue the command ds$SIDD Do not surround the name of the column in quotes when using the $ operator. R Data I/O, Access, and Manipulation September 29, 2010
  • 20. Square Brackets Extraction Operator A single column from a data frame may also be accessed using the square brackets operator [] as follows. To return a vector containing the column named SIDD in the data frame named ds , issue the command ds[,&quot;SIDD&quot;] You must surround the name of the column in double or single quotes when using the [] operator. The comma before the column name is important, as you will see several slides ahead. R Data I/O, Access, and Manipulation September 29, 2010
  • 21. subset() Function A third way to access a single column in a data frame utilizes R’s subset() function. To return a vector containing the column named SIDD in the data frame named ds , issue the command subset(ds, select=&quot;SIDD&quot;) R Data I/O, Access, and Manipulation September 29, 2010
  • 22. Numerical Indices R indexes data containers with integers, beginning at 1 . This is unlike most programming languages, in which indices begin at 0. The square brackets extraction operator also accepts the number of the column. If the third column in the data frame ds is named SIDD , then ds[,&quot;SIDD&quot;] and ds[,3] are equivalent commands. R Data I/O, Access, and Manipulation September 29, 2010
  • 23. Accessing Rows in a Data Frame The rows of a data frame are not generally named, but are numbered beginning at 1. The rows of a data frame can be accessed by either of two methods: Square brackets extraction operator [] subset() function R Data I/O, Access, and Manipulation September 29, 2010
  • 24. Square Brackets Extraction Operator To return a vector containing the n th row of a data frame ds , issue the command ds[n,] The comma after the column name is important. The square brackets expect a row number before the comma, and a column name or number after the comma. R Data I/O, Access, and Manipulation September 29, 2010
  • 25. Square Brackets Extraction Operator Square brackets can also be used to return multiple rows of a data frame. To return a smaller data frame containing the n th through n+m th rows of a data frame ds , issue the command ds[n:(n+m),] The above command also demonstrates the colon operator : , which is used to create sequences of integer numbers, in this case beginning with n and ending with n+m . R Data I/O, Access, and Manipulation September 29, 2010
  • 26. subset() Function The subset() function is sometimes useful in returning multiple rows of a data frame. It is more complicated to use than the square brackets. For example, to extract the 2 nd , 4 th , and 5 th rows of a data frame with 5 rows, we could issue the commands: index <- c(FALSE, TRUE, FALSE, TRUE, TRUE) subset(ds, subset=index) R Data I/O, Access, and Manipulation September 29, 2010
  • 27. Square Brackets Extraction Operator An individual scalar entry within a data frame can be returned by using the square bracket operators, with numbers on both sides of the comma. To return the scalar value in the m th row and n th column of a data frame ds , issue the command ds[m,n] To return the scalar value in the m th row of the data frame ds , in the column named SIDD , issue the command ds[m,&quot;SIDD&quot;] R Data I/O, Access, and Manipulation September 29, 2010
  • 28. Assignment with [] and $ The square brackets and dollar sign can also be used to assign values within a data frame. If the column SIDD in the data frame ds contains numerical data, we can multiply the 5 th entry in the SIDD column by two by issuing the command ds[5,&quot;SIDD&quot;] <- 2 * ds[5,&quot;SIDD&quot;] We could create a new column (or replace the values within the column) named TWICE_SIDD in the data frame ds , and fill it with values twice those in the column SIDD , by issuing the command ds$TWICE_SIDD <- 2 * ds$SIDD R Data I/O, Access, and Manipulation September 29, 2010
  • 29. Dimensions Commands to return the dimensions of a data frame ds are dim(ds) nrow(ds) ncol(ds) dim(ds) returns a vector of length two containing the number of rows in position 1 and the number of columns in position 2. The command to return the length of a vector v is length(v) R Data I/O, Access, and Manipulation September 29, 2010
  • 30. Factors By default, R stores the character vector columns in data frames as factors . In R, a factor is an indexed vector. To factor a vector, R identifies the unique entries in the vector and makes them the levels of the factor. Each vector entry is then indexed by an integer to one of the factor levels. This saves memory when the entries in a vector are not all unique. There are several functions to handle factors. Refer to the R Documentation or Help pages about factors. R Data I/O, Access, and Manipulation September 29, 2010
  • 31. R Data I/O, Access, and Manipulation September 29, 2010 Connections and Line-by-Line Text Input/Output
  • 32. Connections In some cases, it is preferable to import or export data line-by-line . Line-by-line data input/output reduces R’s memory usage and is useful when dealing with very large delimited text datasets. Line-by-line text input/output can be useful for reading and writing log files. The first step in reading line-by-line is opening a file connection . R Data I/O, Access, and Manipulation September 29, 2010
  • 33. Connections for Input R can open a text file connection conn for input using the command conn <- file(filename, open=&quot;rt&quot;) If the specified file exists and is accessible, then a connection is created and opened for text reading. Example: conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) (&quot;rt&quot; indicates “read text”) R Data I/O, Access, and Manipulation September 29, 2010
  • 34. Line-by-Line Input Once a text file input connection is open, we can use one of R’s line-by-line text input functions: readLines(conn, n) scan(…) The scan(…) function is useful for importing delimited data files (e.g. CSV) line by line. The scan(…) function has many arguments. Refer to its lengthy R Documentation page for details. The readLines(…) function is simpler and is useful for reading unstructured lines of text. R Data I/O, Access, and Manipulation September 29, 2010
  • 35. Line-by-Line Input To read one line of text from a file into the scalar character array variable str , we could use the following series of commands conn <- file( &quot;C:/data/in.txt&quot; , open=&quot;rt&quot;) str <- readLines(conn, n=1) close(conn) The close(conn) command closes the connection, leaving the file intact, and leaving str in the R workspace. R Data I/O, Access, and Manipulation September 29, 2010
  • 36. Connections for Output R can create a text file connection conn for output using the command conn <- file(filename, open=&quot;wt&quot;) If the file does not exist, it is created. If the file already exists, its contents are erased ! Example: conn <- file( &quot;C:/data/out.txt&quot; , open=&quot;wt&quot;) (&quot;wt&quot; indicates “write text”) R Data I/O, Access, and Manipulation September 29, 2010
  • 37. Output to a Connection Once a text file output connection is open, we can write text to the connection by making one or more calls to the R function write(&quot;text to write&quot;, file=conn, append=TRUE) Once finished writing text to the connection, close it using the command close(conn) R Data I/O, Access, and Manipulation September 29, 2010
  • 38. Output to a File R can also write directly to a file without creating a connection. In this example, we retain the contents of an existing text file and append new text. To write the contents of the character string str to a file, issue the command write(str, file=filename, append=TRUE) Example: str <- &quot;some text to output \nline 2&quot; write(str, file=&quot;C:/data/out.txt&quot;, append=TRUE) R Data I/O, Access, and Manipulation September 29, 2010
  • 39. Output to a File If the specified file does not exist, the write() command will create it. Be sure to use the append=TRUE option when appending to an existing text file, or the file’s contents will be cleared! There is no need to use the close() command after writing to a file without using a connection, because no persistent connection has been opened. Use the newline character \n to create line breaks in text output. R Data I/O, Access, and Manipulation September 29, 2010
  • 40. Gzip Connections R provides facilities for line-by-line reading and writing of files compressed by the gzip utility. To create a connection to a gzip file for reading , issue the command conn <- gzfile(filename, open=&quot;rt&quot;) To create a connection to a gzip file for writing , issue the command conn <- gzfile(filename, open=&quot;wt&quot;) The readLines() , write() , and close() functions can be used in the same way as with text file connections. R Data I/O, Access, and Manipulation September 29, 2010