Chapter 2 Working with files
So far we have seen how to enter data and create objects manually, but it is also possible, and in fact is most common, to read data from files and store it in an object. If the target data file is properly structured, we will create a matrix
or a ‘data.frame’ object which we can manipulated afterwards.
2.1 The working directory
Using data files normally requires us to specify the location of that file using paths
. To avoid this, R has a tool that allow us to specify a target folder -the so-called working directory or folder- to work with.
The working directory is the default path for reading and writing files of any kind. We can know the path to the current working directory using the getwd()
command. To set a new working directory, we use the command setwd("path")
. Remember that setwd()
requires a string
argument (whereas getwd()
does not) to specify the path to the working directory (“path”).
getwd()
setwd('C:/Users/Marcos/Desktop/')
getwd()
By default, in Windows the working directory is set to the Documents folder.
Once the working directory is specified, everything we do in R (read files, export tables and/or graphics …) will be done in that directory. However, it is possible to work with other file system locations, specifying a different path through the arguments of some functions.
2.2 Reading data files
R allows you to read any type of file in ASCII format (text files). The most frequently used functions are:
read.table()
and its different variationsscan()
read.fwf()
For the development of this course we will focus on using the read.table()
function as it is quite versatile and easy to use. Before starting to use a new function we should always take a look at the available documentation. I will take this oportunity to show you how to do this in R so that you begin to become familiar R help.
Any available function in R, regardless of being a stantard one or belonging to an imported package has a documentation entry in the R help. R help describes us in detail the use of any function, providing information of the different arguments of the function, argument types, defaults, reference to the method (when applies) and even short code examples. To access the help we will use the help()
function as follows:
help(read.table)
Entering the help()
function we access the manual. In this case we can see a brief description of the function read.table()
and its different variants (read.csv()
, …). Below is the description of the arguments of the function, followed by some examples of application. This is the usual procedure for all functions available in R.
In case we use the regular R interface or the cmd terminal, the help entry will open in our default web browser.
At first glance the read.table()
function seems to require many arguments, while its variations seem simpler. As already mentioned, arguments are just parameters that we can specify or change to execute a function, thus tunning the operation of the functions and the result to be obtained.
Arguments are specified within the function separated by commas (“,”). However, it is not necessary to assign a value to each one of them, since in the case of omitting an argument it is assigned a default value (always reported in the help entry). In the case of read.table()
versus other functions like read.csv()
the main advantage of using the first is that we can manipulate any argument, whereas most of them are fixed in the later. read.csv()
is desigend to open comma separate files following the north-american standard (,
as field delimiter and .
as decimal separator). On the other hand, read.table()
can potentially open any text file regardless of the separator, encoding, decimal format and so on so forth. Most of the executions of read.table()
consist of:
read.table(file, header = TRUE, sep = “;”, dec = “,”)
- file: path and name of the file to open.
- header: TRUE/FALSE argument to determine whether the first row of the data file contains column names.
- sep: field or column separator4.
- dec: decimal separator.
Let’s see an example of reading file. We will read the file coordinates.txt, located in the Data directory. The file is structured in 3 columns with heading and separated by “;”. Al data are integer so no decimal separator is needed. See Table 2.2.
The process to follow is:
- Set working directory.
- Use the function to save the data in an object.
If we do…
setwd('C:/Users/rmarc/Desktop/Copia_pen/European_Forestry_Master/module-1/')
read.table('Data/coordinates.txt',header=TRUE,sep=';')
…as we have not specified any object in which to store the result of the function, the contents of the file are simply printed on the terminal. This is a usuall mistake, don’t worry. To store and later access the contents of the file we will do the following:
setwd('C:/Users/rmarc/Desktop/Copia_pen/European_Forestry_Master/module-1/Data')
table<-read.table('coordinates.txt',header=TRUE,sep=';')
table
Please note that:
- You have to specify your own working directory.
- The path to the directory is specified in text format, so you type “in quotation marks”.
- The name of the file to be read is also specified as text.
- The header argument only accepts TRUE or FALSE values.
- The
sep
argument also requires text values to enter de separator5. - It is advisable to save the data an object (table <-).
The result of read.table()
is an array
stored in an object named table
. Since read.table()
returns an array
we can manipulate our table
using the same procedure described in the Arrays section.
table[1,]
## FID_1 X_INDEX Y_INDEX
## 1 364011 82500 4653500
table[,1]
## [1] 364011 371655 487720 474504 436415 457549 469377 397162
## [9] 434666 478383 488973 394153 426962 362216
table[1,1]
## [1] 364011
Below we found some interesting functions to preview and verify the information, and to know the structure of the data we have just imported. These functions are normally used to take a quick look into the first and last rows of an array
or data.frame
object and also to describe the structure of a given object.
head()
: displays the first rows of the array.tail()
: displays the last rows of the array.str()
: displays the structure and data type (factor or number).
head(table)
## FID_1 X_INDEX Y_INDEX
## 1 364011 82500 4653500
## 2 371655 110500 4661500
## 3 487720 55500 4805500
## 4 474504 28500 4783500
## 5 436415 85500 4729500
## 6 457549 38500 4757500
tail(table)
## FID_1 X_INDEX Y_INDEX
## 9 434666 41500 4727500
## 10 478383 49500 4789500
## 11 488973 329500 4807500
## 12 394153 77500 4684500
## 13 426962 36500 4718500
## 14 362216 148500 4651500
str(table)
## 'data.frame': 14 obs. of 3 variables:
## $ FID_1 : int 364011 371655 487720 474504 436415 457549 469377 397162 434666 478383 ...
## $ X_INDEX: int 82500 110500 55500 28500 85500 38500 39500 124500 41500 49500 ...
## $ Y_INDEX: int 4653500 4661500 4805500 4783500 4729500 4757500 4775500 4687500 4727500 4789500 ...
As already mentioned, read.table()
is adequate to start reading of files to incorporate data into our working session in R. In any case we must be aware that there are other possibilities such as the read.csv()
and read.csv2()
that we have already seen when accessing the read.table()
description. These functions are variations that defaults some arguments such as the field separator (columns) or the decimal character. In the help of the function you have information about it.
2.3 Writing text files
Of course, we can also write text files from our data. The procedure is quite similar to read data but using write.table()
instead of read.table
. Remember that the created files are saved into the working directory, unless you specify an alternative path in the function arguments. As always, before starting the first thing is to consult the help of the function.
help("write.table")
The arguments of the function are similar to those already seen in read.table ():
write.table (object, file, row.names, sep)
object
: object of type matrix (or dataframe) to write.file
: name and path to the created file (in text format).row.names
: add or not (TRUE or FALSE) queue names. FALSE is recommended.sep
: column separator (in text format).dec
: decimal separator (in text format).
Try the following instructions6 and observe the different results:
write.table(table,'table1.txt',row.names=TRUE,sep='\t')
write.table(table,'table2.txt',row.names=FALSE,sep='\t')
write.table(table,'table3.txt',row.names=FALSE,sep=';')
write.table(table,'C:/Users/Marcos/Desktop/table4.txt',sep=';')
write.csv(table,'C:/Users/Marcos/Desktop/tabla5.csv')
2.4 Data manipulation
Let’s see the most common instructions for manipulating and extracting information in R. Specifically we will see how to extract subsets of data from objects of type vector
, array
or dataframe
. We will also see how to create new data sets from the aggregation of several objects. There are many commands that allow us to manipulate our data in R. Many things can be understood as manipulation but for the moment we will focus on:
- Select or extract information
- Sort tables
- Add rows or columns to a table
As you might already guess, we will work with tabular data like arrays
and data.frames
which I further refer to as tables.
2.4.1 Working with columns
The first thing we will do is access the information stored in the column(s) of a given table object. There are two basic ways to do this:
- Using the position index of the column.
- Using the name (header) of the column.
These two basic forms are not always interchangeable, so we will use one or the other depending on the case. It is recommended that you use the one that feels most comfortable for you. However, in most examples the column position index is used since it is a numerical value that is very easily integrated with loops and other iterative processes.
2.4.1.1 Extracting columns
To extract columns using the position index we will use a series of instructions similar to those already seen in extracting information from Arrays, Vectors and Lists. The following statement returns the information of the second column of the array
object table
and stores it in a new object that we called col2
:
col2 <- table[,2]
col2
## [1] 82500 110500 55500 28500 85500 38500 39500 124500
## [9] 41500 49500 329500 77500 36500 148500
It is also possible to extract a range of columns, proceeding in a similar way to what has already been seen in the creation of vectors. The following statement extracts columns 2 and 3 from the table
object and stores them in a new object called cols
:
cols <- table[,2:3]
cols
## X_INDEX Y_INDEX
## 1 82500 4653500
## 2 110500 4661500
## 3 55500 4805500
## 4 28500 4783500
## 5 85500 4729500
## 6 38500 4757500
## 7 39500 4775500
## 8 124500 4687500
## 9 41500 4727500
## 10 49500 4789500
## 11 329500 4807500
## 12 77500 4684500
## 13 36500 4718500
## 14 148500 4651500
Now let’s see how to select columns using their name. Name extraction is performed using a combination of object and column name object using $
to separte object
from column name
. The following statement selects the column named Y_INDEX
from the array
object table
and stores it in col.Y_INDEX
:
col.Y_INDEX <- table$Y_INDEX
col.Y_INDEX
## [1] 4653500 4661500 4805500 4783500 4729500 4757500 4775500
## [8] 4687500 4727500 4789500 4807500 4684500 4718500 4651500
A key piece of information here is the name of the column which we need to know in advance. Well, we can check the original text file or inspect the object table
using str()
. We can also take look to the top-right window activaing the Environment sub-window and unwrap table
but be aware this can be only accessed using RStudio.
The main difference between these two methods is that index selection makes it possible to extract column ranges easily. To do this using the name of the columns you have to use functions like subset()
:
cols2 <- subset(table, select = c(X_INDEX,Y_INDEX))
Using the argument select
we can point the columns that we want to extract using a vector with column names. Using subset()
it is also possible to specify the columns that we do NOT want to extract. To do this proceed as follows:
cols2 <- subset(table, select = -c(X_INDEX,Y_INDEX))
In this way we would only extract the first column, excluding X_INDEX and Y_INDEX. Of cours, we can do this using the column index as well:
cols2 <- table[,-(2:3)]
2.4.1.2 Merge columns and tables
The main reason why we are learning how to manipulate table columns is to be able to prepare our data for other purposes. It may be the case we need to join tables or columns that proceed from the same original table. The instruction cbind()
allow us to merge together several tables and/or vectors provided they have the same number of rows. We can merge as many objects as we want to, by separating them using ,
:
cols3 <- cbind(col2,cols2)
cols3
## col2 cols2
## [1,] 82500 364011
## [2,] 110500 371655
## [3,] 55500 487720
## [4,] 28500 474504
## [5,] 85500 436415
## [6,] 38500 457549
## [7,] 39500 469377
## [8,] 124500 397162
## [9,] 41500 434666
## [10,] 49500 478383
## [11,] 329500 488973
## [12,] 77500 394153
## [13,] 36500 426962
## [14,] 148500 362216
2.4.1.3 Changing column names
It is often the case we need to alter or change the name of a table object. If we wanted to rename all the columns of an object we would to pass a vector
with names to the function colnames()
in case we are renaming an array
or names()
if we are dealing with a data.frame
. Note that the vector
should have the same length as the total number of columns. Lets rename our table cols3
:
colnames(cols3)<- c("COL1","COL2")
What if we want to change only a given name. Then we just point to the column header using the position index like this:
colnames(cols3)[2]<- "RENAMED COLUMN"
You may be wondering How can I know what kind of object is my table?. That is a very good question. Specially at the begining is quite difficult to be in control this stuff. If you use RStudio you already see a description of the objects in the top-right window. array
or matrix
objects show something like [1:14,1:2]
indicating multiple dimensions, vectors
are similar but with only 1 dimension [1:14]
and data.frames
show the word data.frame
in their description. However, this is not the fancy way to deal with object types. Just for the record, when I mean type an actual code developer means class. Of course there is a function called class()
that returns the class an object belongs to:
class(cols3)
## [1] "matrix"
2.4.1.4 Sorting our data
Finally, let’s look at how to sort columns and arrays
. To sort a column in R, the sort()
function is used. The general function of the function is:
sort(cols3[,1])
## [1] 28500 36500 38500 39500 41500 49500 55500 77500
## [9] 82500 85500 110500 124500 148500 329500
If we want to reorder an array based on the values of one of its columns, we will use the order () function. The general operation of the function is:
table[order(table$X_INDEX),]
## FID_1 X_INDEX Y_INDEX
## 4 474504 28500 4783500
## 13 426962 36500 4718500
## 6 457549 38500 4757500
## 7 469377 39500 4775500
## 9 434666 41500 4727500
## 10 478383 49500 4789500
## 3 487720 55500 4805500
## 12 394153 77500 4684500
## 1 364011 82500 4653500
## 5 436415 85500 4729500
## 2 371655 110500 4661500
## 8 397162 124500 4687500
## 14 362216 148500 4651500
## 11 488973 329500 4807500
To be honest, in this last example we are actually working with rows. Take a look at the position of the ,
. The brackets are also something that we will use later to extract data from a table. But it feels right to bring here the order()
command right after sort()
.
2.4.2 Working with rows
Let us now turn to the manipulation of rows. The procedure is basically the same as in the case of columns, except for the fact that we normally do not work with names assigned to rows (although that’s a possibility), but we refer to a row using its position. To extract rows or combine several objects according to their rows we use the following expressions:
row1 <- table[1:5,]
row2 <- table[-(6:7),]
row3 <- rbind(row1,row2)
Same as with Arrays we point to rows instead of columns when we use the index value to the left of the [row,col]
. So that’s the thing, we just change that and we are dealing with rows. We can join rows and tables using the rbind()
function rather than cbind()
. r
stands for row
and c
for column
.
We can extract a subsample of rows that meet a given criteria:
table[criteria,]
table[table$X_INDEX==82500,]
## FID_1 X_INDEX Y_INDEX
## 1 364011 82500 4653500
table[table$X_INDEX>82500,]
## FID_1 X_INDEX Y_INDEX
## 2 371655 110500 4661500
## 5 436415 85500 4729500
## 8 397162 124500 4687500
## 11 488973 329500 4807500
## 14 362216 148500 4651500
Oh, I expect you to have found out this by yourself but evidently we can combine row and column manipulation if that fits our purpose.