4 Getting started with data in R

4.1 Importing data

Importing data into R to start your analyses-it should be the easiest step. Unfortunately, this is almost never the case. Data come in all sorts of formats, ranging from CSV and text files and statistical software files to databases and HTML data. Knowing which approach to use is key to getting started with the actual analysis. (Quote from DataCamp’s ‘Importing Data in R (Part 1)’ course)

The default location where R will look or store files, is your working directory (Venables, Smith, and R Core Team 2020). When opening an RStudio project, the working directory is automatically set to the folder where the .Rproj file is located.

# what is the current working directory?
getwd()
# which files are currently stored in my working directory?
dir()

To change the working directory, we use the setwd() function.

setwd("C:/Users/JohnDoe/Documents/RandomProject")

If we have stored the data in a subfolder of our current working directory, we can specify the path as follows.

# where are my data files?
pathData <- file.path('data')

Creating a subfolder for your data files a good way to keep your files and project organized. When learning R, however, it may be easier to use the function file.choose() to use the point and click approach to select your file.

4.1.1 Importing a .csv file with read.csv()

The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function. You will now load a data set on swimming pools in Brisbane, Australia (source: data.gov.au). The file contains the column names in the first row and uses commas to separate values within rows (CSV stands for comma-separated values). (Quote and example from DataCamp’s ‘Importing Data in R (Part 1)’ course)

path.pools <- file.path(pathData, "swimming_pools.csv")
pools <- read.csv(path.pools)
str(pools)
'data.frame':   20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...

With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.

pools <- read.csv(path.pools, stringsAsFactors = FALSE)

We can then use the function str() to get a first compact view of the imported database. This function can be used on any object in R and will help you to understand its structure.

str(pools)
'data.frame':   20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...

4.1.1.1 Importing a .csv file with semicolons as delimiter

Instead of a comma, it is also possible that the different rows are separated by semicolons. It is therefore important to know what kind of delimiter is used in your raw data file. This is why the str() function is so useful, since it will directly show you if the database was imported correctly or not.

policy.path <- file.path(pathData, "policy.csv")
policy <- read.csv(policy.path)
str(policy)
'data.frame':   39075 obs. of  1 variable:
 $ numeropol.debut_pol.fin_pol.freq_paiement.langue.type_prof.alimentation.type_territoire.utilisation.presence_alarme.marque_voiture.sexe.cout1.cout2.cout3.cout4.nbsin.exposition.cout.age.duree_permis.annee_vehicule: chr  "3;14/09/1995;24/04/1996;mensuel;F;Technicien;Végétarien;Urbain;Travail-quotidien;non;VOLKSWAGEN;F;;;;;0;0.61095"| __truncated__ "3;25/04/1996;23/12/1996;mensuel;F;Technicien;Végétarien;Urbain;Travail-quotidien;non;VOLKSWAGEN;F;;;;;0;0.66301"| __truncated__ "6;1/03/1995;27/02/1996;annuel;A;Technicien;Carnivore;Urbain;Travail-occasionnel;oui;NISSAN;M;279.5838509;;;;1;0"| __truncated__ "6;1/03/1996;14/01/1997;annuel;A;Technicien;Carnivore;Urbain;Travail-occasionnel;oui;NISSAN;M;;;;;0;0.873972603;;43;22;1994" ...
dim(policy)
[1] 39075     1

Here, we clearly see that the database was not imported correctly. This is because the policy database uses a semicolon as a delimiter. By default, the delimiter in read.csv() is a comma. To correctly import the database, we can either use

read.csv2(policy.path)

which has the semicolon as default delimiter or we can change the delimiter by changing the argument sep in the read.csv() function.

read.csv(policy.path, sep = ";")

We can even use the function read.table() to import the database, by specifying header = TRUE and sep = ";".

policy <- read.table(policy.path, header = TRUE, sep = ";")

Once we have imported the database, we can further explore it using the following functions.

head(policy)
  numeropol  debut_pol    fin_pol freq_paiement langue  type_prof alimentation
1         3 14/09/1995 24/04/1996       mensuel      F Technicien   Végétarien
2         3 25/04/1996 23/12/1996       mensuel      F Technicien   Végétarien
3         6  1/03/1995 27/02/1996        annuel      A Technicien    Carnivore
4         6  1/03/1996 14/01/1997        annuel      A Technicien    Carnivore
5         6 15/01/1997 31/01/1997        annuel      A Technicien    Carnivore
6         6  1/02/1997 28/02/1997        annuel      A Technicien    Carnivore
  type_territoire         utilisation presence_alarme marque_voiture sexe cout1
1          Urbain   Travail-quotidien             non     VOLKSWAGEN    F    NA
2          Urbain   Travail-quotidien             non     VOLKSWAGEN    F    NA
3          Urbain Travail-occasionnel             oui         NISSAN    M 279.6
4          Urbain Travail-occasionnel             oui         NISSAN    M    NA
5          Urbain Travail-occasionnel             oui         NISSAN    M    NA
6          Urbain Travail-occasionnel             oui         NISSAN    M    NA
  cout2 cout3 cout4 nbsin exposition  cout age duree_permis annee_vehicule
1    NA    NA    NA     0    0.61096    NA  29           10           1989
2    NA    NA    NA     0    0.66301    NA  30           11           1989
3    NA    NA    NA     1    0.99452 279.6  42           21           1994
4    NA    NA    NA     0    0.87397    NA  43           22           1994
5    NA    NA    NA     0    0.04384    NA  44           23           1994
6    NA    NA    NA     0    0.07397    NA  44           23           1994
tail(policy)
      numeropol  debut_pol    fin_pol freq_paiement langue  type_prof
39070     88942 31/03/2003 30/03/2004       mensuel      A   Actuaire
39071     88945 21/03/2003 20/03/2004        annuel      A Technicien
39072     88972 18/03/2003 17/03/2004       mensuel      F Technicien
39073     88981 19/03/2003 18/03/2004       mensuel      A Technicien
39074     88986 28/02/2004 27/02/2005       mensuel      A    Médecin
39075     89009 24/03/2003 23/03/2004       mensuel      A Technicien
      alimentation type_territoire         utilisation presence_alarme
39070    Carnivore          Urbain Travail-occasionnel             oui
39071   Végétalien          Urbain Travail-occasionnel             oui
39072   Végétarien     Semi-urbain   Travail-quotidien             non
39073   Végétalien          Urbain Travail-occasionnel             oui
39074    Carnivore          Urbain   Travail-quotidien             oui
39075   Végétarien          Urbain Travail-occasionnel             non
      marque_voiture sexe cout1 cout2 cout3 cout4 nbsin exposition cout age
39070            BMW    M    NA    NA    NA    NA     0          1   NA  45
39071         TOYOTA    M    NA    NA    NA    NA     0          1   NA  24
39072        PEUGEOT    M    NA    NA    NA    NA     0          1   NA  58
39073         SUZUKI    F    NA    NA    NA    NA     0          1   NA  23
39074           FIAT    M    NA    NA    NA    NA     0          1   NA  41
39075         TOYOTA    M    NA    NA    NA    NA     0          1   NA  58
      duree_permis annee_vehicule
39070           30           1989
39071            5           2000
39072           33           2003
39073            5           1998
39074           19           1989
39075           37           2003
names(policy)
 [1] "numeropol"       "debut_pol"       "fin_pol"         "freq_paiement"  
 [5] "langue"          "type_prof"       "alimentation"    "type_territoire"
 [9] "utilisation"     "presence_alarme" "marque_voiture"  "sexe"           
[13] "cout1"           "cout2"           "cout3"           "cout4"          
[17] "nbsin"           "exposition"      "cout"            "age"            
[21] "duree_permis"    "annee_vehicule" 
dim(policy)
[1] 39075    22

For the purpose of this chapter, we will write a custom function to explore the data.frame. For now, it is not necessary to understand it, as we will discuss functions in detail in Chapter 8.

ExploreDf <- function(x) {
        if(!is.data.frame(x))
                stop("Only objects of type data.frame are allowed.")
        StrAdj  <- function(x) capture.output(str(x))
        fntions <- setNames(list(names, dim, head, tail, StrAdj), c("names", "dim", "head", "tail", "str"))
        Res <- sapply(fntions, function(f) f(x), simplify = F)
        for(i in seq_along(Res)) {
                cat("\n\n", names(Res)[i], ":\n")
                if(i %in% 3:4)
                        print.data.frame(Res[[i]])
                else if(i != 5)
                        print(Res[[i]])
                else
                        cat(paste0(Res[[5]], collapse = "\n"))
        }
}

4.1.2 Importing a .txt file: the Danish fire insurance data

read.table() is the most basic function to import data sets. The read.table() function is in fact the function that is used by the read.csv() and read.csv2() functions, but the arguments of the two latter functions are different. You can easily see this by looking at the functions themselves.

read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) 
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x0000020c8468db30>
<environment: namespace:utils>
read.csv2
function (file, header = TRUE, sep = ";", quote = "\"", dec = ",", 
    fill = TRUE, comment.char = "", ...) 
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x0000020cffc99c98>
<environment: namespace:utils>

An important difference with the read.csv() function, is that the header argument defaults to FALSE and the sep argument is "" by default in read.table(). (Quote from DataCamp’s ‘Importing Data in R (Part 1)’ course)

header is an argument that requires a logical argument (i.e. TRUE or FALSE) and is used to indicate whether the file contains the variable names as its first line. So, in almost all cases, you will have to set this to TRUE.

path.fire <- file.path(pathData, "danish.txt")
danish <- read.table(path.fire, header = TRUE)
head(danish, n = 10) # use the argument 'n' to display less/more records
         Date Loss.in.DKM
1  01/03/1980       1.684
2  01/04/1980       2.094
3  01/05/1980       1.733
4  01/07/1980       1.780
5  01/07/1980       4.612
6  01/10/1980       8.725
7  01/10/1980       7.899
8  01/16/1980       2.208
9  01/16/1980       1.486
10 01/19/1980       2.796
ExploreDf(danish)


 names :
[1] "Date"        "Loss.in.DKM"


 dim :
[1] 2167    2


 head :
        Date Loss.in.DKM
1 01/03/1980       1.684
2 01/04/1980       2.094
3 01/05/1980       1.733
4 01/07/1980       1.780
5 01/07/1980       4.612
6 01/10/1980       8.725


 tail :
           Date Loss.in.DKM
2162 12/24/1990       1.238
2163 12/27/1990       1.115
2164 12/30/1990       1.403
2165 12/30/1990       4.868
2166 12/30/1990       1.073
2167 12/31/1990       4.125


 str :
'data.frame':   2167 obs. of  2 variables:
 $ Date       : chr  "01/03/1980" "01/04/1980" "01/05/1980" "01/07/1980" ...
 $ Loss.in.DKM: num  1.68 2.09 1.73 1.78 4.61 ...

Compared to the raw .txt file, something is a bit different in the imported dataset. Can you see what’s different? That’s right, the second column name has dots instead of hyphens.

# Function to open files using R
opendir <- function(dir = getwd()){
  if (.Platform['OS.type'] == "windows"){
    shell.exec(dir)
  } else {
    system(paste(Sys.getenv("R_BROWSER"), dir))
  }
}
# open file
opendir(paste0(getwd(), "/", path.fire))

When importing a data file with headers, R checks if the column names are syntactically valid (i.e. valid in R-code). If not, these are adjusted to make sure that they are usable in R. One way to avoid this kind of behavior, is to set the check.names argument to FALSE.

danish <- read.table(path.fire, header = TRUE, check.names = FALSE)
names(danish)
[1] "Date"        "Loss-in-DKM"

Note, however, that you will now have to select your columns using backticks (e.g. df$variable``).

# The following does not work
# danish$Loss-in-DKM
# This does work
head(danish$`Loss-in-DKM`)
[1] 1.684 2.094 1.733 1.780 4.612 8.725

You can also explicitly specify the column names and as well as column types/classes of the resulting data frame. You can do this by setting the col.names and the colClasses argument to a vector of strings. (Quote from DataCamp’s ‘Importing Data in R (Part 1)’ course)

path.hotdogs <- file.path(pathData, "hotdogs.txt")
hotdogs <- read.table(path.hotdogs, header = FALSE, col.names = c("type", "calories", "sodium"))
# display structure of hotdogs
str(hotdogs)
'data.frame':   54 obs. of  3 variables:
 $ type    : chr  "Beef" "Beef" "Beef" "Beef" ...
 $ calories: int  186 181 176 149 184 190 158 139 175 148 ...
 $ sodium  : int  495 477 425 322 482 587 370 322 479 375 ...
# edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.table(path.hotdogs, header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))
# display structure of hotdogs2
str(hotdogs2)
'data.frame':   54 obs. of  2 variables:
 $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...

What happened? What is the effect of specifying one of the colClasses as NULL?

Another very useful function to explore your data.frame is the summary() function. This function can be used on a whole range of objects and provides you with a short summary of the object (as you might have expected).

summary(danish)
     Date            Loss-in-DKM    
 Length:2167        Min.   :  1.00  
 Class :character   1st Qu.:  1.32  
 Mode  :character   Median :  1.78  
                    Mean   :  3.39  
                    3rd Qu.:  2.97  
                    Max.   :263.25  

Inspecting the data.frame, we see that the class of the variable Date is not correct. One way to convert this variable to an object with class Date is by using the as.Date() function.

danish$Date <- as.Date(danish$Date, format = "%m/%d/%Y")
str(danish)
'data.frame':   2167 obs. of  2 variables:
 $ Date       : Date, format: "1980-01-03" "1980-01-04" ...
 $ Loss-in-DKM: num  1.68 2.09 1.73 1.78 4.61 ...

To understand why format = "%m/%d/%Y", you can check the help page of strptime(). Shortly summarized, %m indicates that month is given as a decimal number, %d that the day of the month is also given as a decimal number, %Y that year with century is given and the latter three are separated by / as this is also the delimiter that us used in the Date column.

?strptime
OrigLocal = Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "English_United States.1252")
[1] "English_United States.1252"
as.Date("16-Apr-1963", format = "%d-%b-%Y")
[1] "1963-04-16"
Sys.setlocale("LC_TIME", OrigLocal)
[1] "English_United States.1252"

Or you can try to fix this directly when importing the danish.txt.

path.fire <- file.path(pathData, "danish.txt")
danish <- read.table(path.fire, header = TRUE, colClasses = c("Date", "numeric"))
head(danish$Date)
[1] "0001-03-19" "0001-04-19" "0001-05-19" "0001-07-19" "0001-07-19"
[6] "0001-10-19"

Setting colClasses to Date, however, only works when the format is either "%Y-%m-%d" or "%Y/%m/%d" (see ?as.Date). We therefore have to put in some extra effort and create a custom function/class to correctly import the Date variable.

setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format = "%m/%d/%Y"))
danish2 <- read.table(path.fire, header = TRUE, colClasses = c("myDate", "numeric"))
str(danish2)
'data.frame':   2167 obs. of  2 variables:
 $ Date       : Date, format: "1980-01-03" "1980-01-04" ...
 $ Loss.in.DKM: num  1.68 2.09 1.73 1.78 4.61 ...

4.1.3 Importing a .sas7bdat file

When you decided to make the switch from SAS to R, you obviously made the right choice. SAS has been around for a very long time, but even since 1991 it was clear that R has always been the better choice.

if(!"fortunes" %in% rownames(installed.packages()))
        install.packages("fortunes")
fortunes::fortune(22)

I quit using SAS in 1991 because my productivity jumped at least 20% within one
month of using S-Plus.
   -- Frank Harrell
      R-help (November 2003)
fortunes::fortune(84)

There are companies whose yearly license fees to SAS total millions of dollars.
Then those companies hire armies of SAS programmers to program an archaic macro
language using old statistical methods to produce ugly tables and the worst
graphics in the statistical software world.
   -- Frank Harrell
      R-help (November 2004)
fortunes::fortune(224)

It's interesting that SAS Institute feels that non-peer-reviewed software with
hidden implementations of analytic methods that cannot be reproduced by others
should be trusted when building aircraft engines.
   -- Frank Harrell (in response to the statement of the SAS director of
      technology product marketing: "We have customers who build engines for
      aircraft. I am happy they are not using freeware when I get on a jet.")
      R-help (January 2009)

To import a SAS file that has a sas7bdat format, we use the read.sas7bdat() function from the sas7bdat package.

if(!"sas7bdat" %in% rownames(installed.packages()))
        install.packages("sas7bdat")
library(sas7bdat)
path.severity <- file.path(pathData, "severity.sas7bdat")
severity <- read.sas7bdat(path.severity)
ExploreDf(severity)


 names :
[1] "policyId"    "claimId"     "rc"          "deductible"  "claimAmount"


 dim :
[1] 19287     5


 head :
  policyId claimId    rc deductible claimAmount
1    6e+05   9e+05 35306       1200       35306
2    6e+05   9e+05 19773         50       19773
3    6e+05   9e+05 41639        100       41639
4    6e+05   9e+05 10649         50       10649
5    6e+05   9e+05 20479         50       20479
6    6e+05   9e+05  9853         50        9853


 tail :
      policyId claimId   rc deductible claimAmount
19282   612853  919300  NaN         50         151
19283   612854  919301 1587        300        1587
19284   612854  919302  NaN        300         574
19285   612855  919303  NaN         50         323
19286   612856  919304 1287       1200        1287
19287   612857  919305 1910       1200        1910


 str :
'data.frame':   19287 obs. of  5 variables:
 $ policyId   : num  6e+05 6e+05 6e+05 6e+05 6e+05 ...
 $ claimId    : num  9e+05 9e+05 9e+05 9e+05 9e+05 ...
 $ rc         : num  35306 19773 41639 10649 20479 ...
 $ deductible : num  1200 50 100 50 50 50 50 50 50 50 ...
 $ claimAmount: num  35306 19773 41639 10649 20479 ...
 - attr(*, "pkg.version")= chr "0.5"
 - attr(*, "column.info")=List of 5
  ..$ :List of 11
  .. ..$ name  : chr "policyId"
  .. ..$ offset: int 0
  .. ..$ length: int 8
  .. ..$ type  : chr "numeric"
  .. ..$ format: chr "BEST"
  .. ..$ fhdr  : int 0
  .. ..$ foff  : int 44
  .. ..$ flen  : int 4
  .. ..$ lhdr  : int 0
  .. ..$ loff  : int 0
  .. ..$ llen  : int 0
  ..$ :List of 11
  .. ..$ name  : chr "claimId"
  .. ..$ offset: int 8
  .. ..$ length: int 8
  .. ..$ type  : chr "numeric"
  .. ..$ format: chr "BEST"
  .. ..$ fhdr  : int 0
  .. ..$ foff  : int 56
  .. ..$ flen  : int 4
  .. ..$ lhdr  : int 0
  .. ..$ loff  : int 0
  .. ..$ llen  : int 0
  ..$ :List of 11
  .. ..$ name  : chr "rc"
  .. ..$ offset: int 16
  .. ..$ length: int 8
  .. ..$ type  : chr "numeric"
  .. ..$ format: chr "BEST"
  .. ..$ fhdr  : int 0
  .. ..$ foff  : int 64
  .. ..$ flen  : int 4
  .. ..$ lhdr  : int 0
  .. ..$ loff  : int 0
  .. ..$ llen  : int 0
  ..$ :List of 11
  .. ..$ name  : chr "deductible"
  .. ..$ offset: int 24
  .. ..$ length: int 8
  .. ..$ type  : chr "numeric"
  .. ..$ format: chr "BEST"
  .. ..$ fhdr  : int 0
  .. ..$ foff  : int 80
  .. ..$ flen  : int 4
  .. ..$ lhdr  : int 0
  .. ..$ loff  : int 0
  .. ..$ llen  : int 0
  ..$ :List of 11
  .. ..$ name  : chr "claimAmount"
  .. ..$ offset: int 32
  .. ..$ length: int 8
  .. ..$ type  : chr "numeric"
  .. ..$ format: chr "COMMA"
  .. ..$ fhdr  : int 0
  .. ..$ foff  : int 96
  .. ..$ flen  : int 5
  .. ..$ lhdr  : int 0
  .. ..$ loff  : int 0
  .. ..$ llen  : int 0
 - attr(*, "date.created")= POSIXct[1:1], format: "1960-01-01"
 - attr(*, "date.modified")= POSIXct[1:1], format: "1960-01-01"
 - attr(*, "SAS.release")= chr "9.0401M1"
 - attr(*, "SAS.host")= chr "X64_7PRO"
 - attr(*, "OS.version")= chr ""
 - attr(*, "OS.maker")= chr ""
 - attr(*, "OS.name")= chr ""
 - attr(*, "endian")= chr "little"
 - attr(*, "winunix")= chr "windows"

4.1.4 Importing an .xlsx file

You will import data from Excel using the readxl package (authored by Hadley Wickham and maintained by RStudio).

Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this. (Quote and example from DataCamp’s ‘Importing Data in R (Part 1)’ course)

# load the readxl package
library(readxl)
path.urbanpop <- file.path(pathData, "urbanpop.xlsx")
excel_sheets(path.urbanpop)
[1] "1960-1966" "1967-1974" "1975-2011"

You can import the Excel file with the read_excel() function. Here is the recipe:

pop_1 <- read_excel(path.urbanpop, sheet = 1)
pop_2 <- read_excel(path.urbanpop, sheet = 2)
pop_3 <- read_excel(path.urbanpop, sheet = 3)
str(pop_1)
tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
 $ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
 $ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
 $ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
 $ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
 $ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
 $ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
 $ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
 $ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
# put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1, pop_2, pop_3)

The object pop_1 is a tibble, an object of tbl_df class (the ‘tibble’) that provides stricter checking and better formatting than the traditional data frame. The main advantage to using a tbl_df over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text. If you want to stick to traditional data frames, you can convert it using the as.data.frame function.

pop_1_df <- as.data.frame(pop_1)
str(pop_1_df)
'data.frame':   209 obs. of  8 variables:
 $ country: chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
 $ 1960   : num  769308 494443 3293999 NA NA ...
 $ 1961   : num  814923 511803 3515148 13660 8724 ...
 $ 1962   : num  858522 529439 3739963 14166 9700 ...
 $ 1963   : num  903914 547377 3973289 14759 10748 ...
 $ 1964   : num  951226 565572 4220987 15396 11866 ...
 $ 1965   : num  1000582 583983 4488176 16045 13053 ...
 $ 1966   : num  1058743 602512 4649105 16693 14217 ...

In the previous demo you generated a list of three Excel sheets that you imported. However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply(). (Quote from DataCamp’s ‘Importing Data in R (Part 1)’ course)

pop_list <- lapply(excel_sheets(path.urbanpop), read_excel, path = path.urbanpop)
str(pop_list)
List of 3
 $ : tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
  ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
  ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
  ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
  ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
  ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
  ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
 $ : tibble [209 x 9] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
  ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
  ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
  ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
  ..$ 1971   : num [1:209] 1409001 698932 5619042 19752 20929 ...
  ..$ 1972   : num [1:209] 1502402 720207 5815734 20263 22406 ...
  ..$ 1973   : num [1:209] 1598835 741681 6020647 20742 23937 ...
  ..$ 1974   : num [1:209] 1696445 763385 6235114 21194 25482 ...
 $ : tibble [209 x 38] (S3: tbl_df/tbl/data.frame)
  ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
  ..$ 1975   : num [1:209] 1793266 785350 6460138 21632 27019 ...
  ..$ 1976   : num [1:209] 1905033 807990 6774099 22047 28366 ...
  ..$ 1977   : num [1:209] 2021308 830959 7102902 22452 29677 ...
  ..$ 1978   : num [1:209] 2142248 854262 7447728 22899 31037 ...
  ..$ 1979   : num [1:209] 2268015 877898 7810073 23457 32572 ...
  ..$ 1980   : num [1:209] 2398775 901884 8190772 24177 34366 ...
  ..$ 1981   : num [1:209] 2493265 927224 8637724 25173 36356 ...
  ..$ 1982   : num [1:209] 2590846 952447 9105820 26342 38618 ...
  ..$ 1983   : num [1:209] 2691612 978476 9591900 27655 40983 ...
  ..$ 1984   : num [1:209] 2795656 1006613 10091289 29062 43207 ...
  ..$ 1985   : num [1:209] 2903078 1037541 10600112 30524 45119 ...
  ..$ 1986   : num [1:209] 3006983 1072365 11101757 32014 46254 ...
  ..$ 1987   : num [1:209] 3113957 1109954 11609104 33548 47019 ...
  ..$ 1988   : num [1:209] 3224082 1146633 12122941 35095 47669 ...
  ..$ 1989   : num [1:209] 3337444 1177286 12645263 36618 48577 ...
  ..$ 1990   : num [1:209] 3454129 1198293 13177079 38088 49982 ...
  ..$ 1991   : num [1:209] 3617842 1215445 13708813 39600 51972 ...
  ..$ 1992   : num [1:209] 3788685 1222544 14248297 41049 54469 ...
  ..$ 1993   : num [1:209] 3966956 1222812 14789176 42443 57079 ...
  ..$ 1994   : num [1:209] 4152960 1221364 15322651 43798 59243 ...
  ..$ 1995   : num [1:209] 4347018 1222234 15842442 45129 60598 ...
  ..$ 1996   : num [1:209] 4531285 1228760 16395553 46343 60927 ...
  ..$ 1997   : num [1:209] 4722603 1238090 16935451 47527 60462 ...
  ..$ 1998   : num [1:209] 4921227 1250366 17469200 48705 59685 ...
  ..$ 1999   : num [1:209] 5127421 1265195 18007937 49906 59281 ...
  ..$ 2000   : num [1:209] 5341456 1282223 18560597 51151 59719 ...
  ..$ 2001   : num [1:209] 5564492 1315690 19198872 52341 61062 ...
  ..$ 2002   : num [1:209] 5795940 1352278 19854835 53583 63212 ...
  ..$ 2003   : num [1:209] 6036100 1391143 20529356 54864 65802 ...
  ..$ 2004   : num [1:209] 6285281 1430918 21222198 56166 68301 ...
  ..$ 2005   : num [1:209] 6543804 1470488 21932978 57474 70329 ...
  ..$ 2006   : num [1:209] 6812538 1512255 22625052 58679 71726 ...
  ..$ 2007   : num [1:209] 7091245 1553491 23335543 59894 72684 ...
  ..$ 2008   : num [1:209] 7380272 1594351 24061749 61118 73335 ...
  ..$ 2009   : num [1:209] 7679982 1635262 24799591 62357 73897 ...
  ..$ 2010   : num [1:209] 7990746 1676545 25545622 63616 74525 ...
  ..$ 2011   : num [1:209] 8316976 1716842 26216968 64817 75207 ...

Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names. By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. (Quote from DataCamp’s ‘Importing Data in R (Part 1)’ course)

path.urbanpop_nonames <- file.path(pathData, "urbanpop_nonames.xlsx")
# Import the the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel(path.urbanpop_nonames, col_names = FALSE)
New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* ...
# Import the the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel(path.urbanpop_nonames, col_names = cols)
# Print the summary of pop_a
summary(pop_a)
     ...1                ...2               ...3               ...4         
 Length:209         Min.   :3.38e+03   Min.   :1.03e+03   Min.   :1.09e+03  
 Class :character   1st Qu.:8.90e+04   1st Qu.:7.06e+04   1st Qu.:7.50e+04  
 Mode  :character   Median :5.81e+05   Median :5.70e+05   Median :5.94e+05  
                    Mean   :4.99e+06   Mean   :4.99e+06   Mean   :5.14e+06  
                    3rd Qu.:3.08e+06   3rd Qu.:2.81e+06   3rd Qu.:2.95e+06  
                    Max.   :1.26e+08   Max.   :1.29e+08   Max.   :1.32e+08  
                    NA's   :11                                              
      ...5               ...6               ...7               ...8         
 Min.   :1.15e+03   Min.   :1.22e+03   Min.   :1.28e+03   Min.   :1.35e+03  
 1st Qu.:8.19e+04   1st Qu.:8.50e+04   1st Qu.:8.86e+04   1st Qu.:9.36e+04  
 Median :6.19e+05   Median :6.45e+05   Median :6.79e+05   Median :7.35e+05  
 Mean   :5.30e+06   Mean   :5.47e+06   Mean   :5.64e+06   Mean   :5.79e+06  
 3rd Qu.:3.15e+06   3rd Qu.:3.30e+06   3rd Qu.:3.32e+06   3rd Qu.:3.42e+06  
 Max.   :1.35e+08   Max.   :1.37e+08   Max.   :1.40e+08   Max.   :1.42e+08  
                                                                            
# Print the summary of pop_b
summary(pop_b)
   country            year_1960          year_1961          year_1962       
 Length:209         Min.   :3.38e+03   Min.   :1.03e+03   Min.   :1.09e+03  
 Class :character   1st Qu.:8.90e+04   1st Qu.:7.06e+04   1st Qu.:7.50e+04  
 Mode  :character   Median :5.81e+05   Median :5.70e+05   Median :5.94e+05  
                    Mean   :4.99e+06   Mean   :4.99e+06   Mean   :5.14e+06  
                    3rd Qu.:3.08e+06   3rd Qu.:2.81e+06   3rd Qu.:2.95e+06  
                    Max.   :1.26e+08   Max.   :1.29e+08   Max.   :1.32e+08  
                    NA's   :11                                              
   year_1963          year_1964          year_1965          year_1966       
 Min.   :1.15e+03   Min.   :1.22e+03   Min.   :1.28e+03   Min.   :1.35e+03  
 1st Qu.:8.19e+04   1st Qu.:8.50e+04   1st Qu.:8.86e+04   1st Qu.:9.36e+04  
 Median :6.19e+05   Median :6.45e+05   Median :6.79e+05   Median :7.35e+05  
 Mean   :5.30e+06   Mean   :5.47e+06   Mean   :5.64e+06   Mean   :5.79e+06  
 3rd Qu.:3.15e+06   3rd Qu.:3.30e+06   3rd Qu.:3.32e+06   3rd Qu.:3.42e+06  
 Max.   :1.35e+08   Max.   :1.37e+08   Max.   :1.40e+08   Max.   :1.42e+08  
                                                                            

In the code printed above, paste0 (and also paste) converts its arguments (via as.character) to character strings, and concatenates them (in case of paste separating them by the string given by the argument sep, which is a single space by default).

Many other packages exist to import Excel data, including XLConnect, an Excel Connector for R that provides comprehensive functionality to read, write and format Excel data. See DataCamp’s Importing Data in R (Part 1) course.


4.2 Basic data handling steps

You will now learn some basic functions to handle data in R. You start with basic instructions (from base R) for data handling and more on data wrangling follows in Chapter 6. Useful functions from base are subset, sort, order, merge, cbind and rbind. Manipulating the data typically consumes a lot of effort in the beginning, but will become second nature once you get the hang of it.

Manipulating the data often requires repeated operations on different sections of the data, in a ‘split-apply-combine’ way of working. Let’s illustrate all of this below. Some of the examples that follow are taken from Michael Clark’s `An introduction to R’.

4.2.1 Subsetting

The data set state.x77 is available from the package datasets. This package contains a variety of data sets and some of them contain information on all 50 states of the United States of America. The state.x77 data set, for example, is a matrix with 50 rows and 8 columns which contains a wealth of information on all 50 different states.

?state.x77
states <- data.frame(state.x77)
ExploreDf(states)


 names :
[1] "Population" "Income"     "Illiteracy" "Life.Exp"   "Murder"    
[6] "HS.Grad"    "Frost"      "Area"      


 dim :
[1] 50  8


 head :
           Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766


 tail :
              Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203


 str :
'data.frame':   50 obs. of  8 variables:
 $ Population: num  3615 365 2212 2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num  50708 566432 113417 51945 156361 ...
states[14, ]
        Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
Indiana       5313   4458        0.7    70.88    7.1    52.9   122 36097
states[3, 6, drop = F] # use drop = FALSE to keep it as a data.frame
        HS.Grad
Arizona    58.1
states[, 'Frost'] 
 [1]  20 152  15  65  20 166 139 103  11  60   0 126 127 122 140 114  95  12 161
[20] 101 103 125 160  50 108 155 139 188 174 115 120  82  80 186 124  82  44 126
[39] 127  65 172  70  35 137 168  85  32 100 149 173
states$Frost
 [1]  20 152  15  65  20 166 139 103  11  60   0 126 127 122 140 114  95  12 161
[20] 101 103 125 160  50 108 155 139 188 174 115 120  82  80 186 124  82  44 126
[39] 127  65 172  70  35 137 168  85  32 100 149 173

You will also use the data stored in state.region, a factor giving the region (Northeast, South, North Central, West) that each state belongs to.

state.region
 [1] South         West          West          South         West         
 [6] West          Northeast     South         South         South        
[11] West          West          North Central North Central North Central
[16] North Central South         South         Northeast     South        
[21] Northeast     North Central North Central South         North Central
[26] West          North Central West          Northeast     Northeast    
[31] West          Northeast     South         North Central North Central
[36] South         West          Northeast     Northeast     South        
[41] North Central South         South         West          Northeast    
[46] South         West          South         North Central West         
Levels: Northeast South North Central West
length(state.region)
[1] 50
# select those states that are in the south of the US 
mysubset <- subset(states, state.region == "South")
# subset a selection of variables
str(states)
'data.frame':   50 obs. of  8 variables:
 $ Population: num  3615 365 2212 2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num  50708 566432 113417 51945 156361 ...
(mysubset <- states[, c(1:2, 7:8)])
               Population Income Frost   Area
Alabama              3615   3624    20  50708
Alaska                365   6315   152 566432
Arizona              2212   4530    15 113417
Arkansas             2110   3378    65  51945
California          21198   5114    20 156361
Colorado             2541   4884   166 103766
Connecticut          3100   5348   139   4862
Delaware              579   4809   103   1982
Florida              8277   4815    11  54090
Georgia              4931   4091    60  58073
Hawaii                868   4963     0   6425
Idaho                 813   4119   126  82677
Illinois            11197   5107   127  55748
Indiana              5313   4458   122  36097
Iowa                 2861   4628   140  55941
Kansas               2280   4669   114  81787
Kentucky             3387   3712    95  39650
Louisiana            3806   3545    12  44930
Maine                1058   3694   161  30920
Maryland             4122   5299   101   9891
Massachusetts        5814   4755   103   7826
Michigan             9111   4751   125  56817
Minnesota            3921   4675   160  79289
Mississippi          2341   3098    50  47296
Missouri             4767   4254   108  68995
Montana               746   4347   155 145587
Nebraska             1544   4508   139  76483
Nevada                590   5149   188 109889
New Hampshire         812   4281   174   9027
New Jersey           7333   5237   115   7521
New Mexico           1144   3601   120 121412
New York            18076   4903    82  47831
North Carolina       5441   3875    80  48798
North Dakota          637   5087   186  69273
Ohio                10735   4561   124  40975
Oklahoma             2715   3983    82  68782
Oregon               2284   4660    44  96184
Pennsylvania        11860   4449   126  44966
Rhode Island          931   4558   127   1049
South Carolina       2816   3635    65  30225
South Dakota          681   4167   172  75955
Tennessee            4173   3821    70  41328
Texas               12237   4188    35 262134
Utah                 1203   4022   137  82096
Vermont               472   3907   168   9267
Virginia             4981   4701    85  39780
Washington           3559   4864    32  66570
West Virginia        1799   3617   100  24070
Wisconsin            4589   4468   149  54464
Wyoming               376   4566   173  97203
(mysubset <- states[, c("Population", "Income", "Frost", "Area")])
               Population Income Frost   Area
Alabama              3615   3624    20  50708
Alaska                365   6315   152 566432
Arizona              2212   4530    15 113417
Arkansas             2110   3378    65  51945
California          21198   5114    20 156361
Colorado             2541   4884   166 103766
Connecticut          3100   5348   139   4862
Delaware              579   4809   103   1982
Florida              8277   4815    11  54090
Georgia              4931   4091    60  58073
Hawaii                868   4963     0   6425
Idaho                 813   4119   126  82677
Illinois            11197   5107   127  55748
Indiana              5313   4458   122  36097
Iowa                 2861   4628   140  55941
Kansas               2280   4669   114  81787
Kentucky             3387   3712    95  39650
Louisiana            3806   3545    12  44930
Maine                1058   3694   161  30920
Maryland             4122   5299   101   9891
Massachusetts        5814   4755   103   7826
Michigan             9111   4751   125  56817
Minnesota            3921   4675   160  79289
Mississippi          2341   3098    50  47296
Missouri             4767   4254   108  68995
Montana               746   4347   155 145587
Nebraska             1544   4508   139  76483
Nevada                590   5149   188 109889
New Hampshire         812   4281   174   9027
New Jersey           7333   5237   115   7521
New Mexico           1144   3601   120 121412
New York            18076   4903    82  47831
North Carolina       5441   3875    80  48798
North Dakota          637   5087   186  69273
Ohio                10735   4561   124  40975
Oklahoma             2715   3983    82  68782
Oregon               2284   4660    44  96184
Pennsylvania        11860   4449   126  44966
Rhode Island          931   4558   127   1049
South Carolina       2816   3635    65  30225
South Dakota          681   4167   172  75955
Tennessee            4173   3821    70  41328
Texas               12237   4188    35 262134
Utah                 1203   4022   137  82096
Vermont               472   3907   168   9267
Virginia             4981   4701    85  39780
Washington           3559   4864    32  66570
West Virginia        1799   3617   100  24070
Wisconsin            4589   4468   149  54464
Wyoming               376   4566   173  97203

Next to the function subset(), we can also use the vector indices (see Chapter 3). When using which(), it returns the indices for which the logical expression is TRUE. It is in fact always safer to use the logical statement in combination with which, since which automatically treats missing values as FALSE. Just look at what goes wrong in the following example.

CopyStates = states
CopyStates$Income[sample(1:nrow(CopyStates), 3, F)] = NA
CopyStates[CopyStates$Income > 5000, ]
             Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alaska              365   6315        1.5    69.31   11.3    66.7   152 566432
California        21198   5114        1.1    71.71   10.3    62.6    20 156361
NA                   NA     NA         NA       NA     NA      NA    NA     NA
Illinois          11197   5107        0.9    70.14   10.3    52.6   127  55748
NA.1                 NA     NA         NA       NA     NA      NA    NA     NA
Maryland           4122   5299        0.9    70.22    8.5    52.3   101   9891
Nevada              590   5149        0.5    69.03   11.5    65.2   188 109889
New Jersey         7333   5237        1.1    70.93    5.2    52.5   115   7521
North Dakota        637   5087        0.8    72.78    1.4    50.3   186  69273
NA.2                 NA     NA         NA       NA     NA      NA    NA     NA
CopyStates[which(CopyStates$Income > 5000), ]
             Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alaska              365   6315        1.5    69.31   11.3    66.7   152 566432
California        21198   5114        1.1    71.71   10.3    62.6    20 156361
Illinois          11197   5107        0.9    70.14   10.3    52.6   127  55748
Maryland           4122   5299        0.9    70.22    8.5    52.3   101   9891
Nevada              590   5149        0.5    69.03   11.5    65.2   188 109889
New Jersey         7333   5237        1.1    70.93    5.2    52.5   115   7521
North Dakota        637   5087        0.8    72.78    1.4    50.3   186  69273

4.2.2 Find minimum or maximum

A similar function to which() is the function which.min(), which returns the index of the smallest value in a vector. which.max() works in a similar way. Using the information stored in states, which states in the US have the smallest, respectively highest, population density?

least_pop <- which.min(states$Population)
states[least_pop, ]
       Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
most_pop <- which.max(states$Population)
states[most_pop, ]
           Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
California      21198   5114        1.1    71.71   10.3    62.6    20 156361

Next to these functions, pmin() and pmax() are also incredibly useful. Let’s illustrate the difference with the functions min() and max() with a short example.

a <- 1:5
b <- -1 * (1:5)
min(a, b)
[1] -5
pmin(a, b)
[1] -1 -2 -3 -4 -5
max(a, b)
[1] 5
pmax(a, b)
[1] 1 2 3 4 5

4.2.3 Sorting

To sort a vector, you can use the function sort().

sort(states$Population)
 [1]   365   376   472   579   590   637   681   746   812   813   868   931
[13]  1058  1144  1203  1544  1799  2110  2212  2280  2284  2341  2541  2715
[25]  2816  2861  3100  3387  3559  3615  3806  3921  4122  4173  4589  4767
[37]  4931  4981  5313  5441  5814  7333  8277  9111 10735 11197 11860 12237
[49] 18076 21198
sort(states$Population, decreasing = T)
 [1] 21198 18076 12237 11860 11197 10735  9111  8277  7333  5814  5441  5313
[13]  4981  4931  4767  4589  4173  4122  3921  3806  3615  3559  3387  3100
[25]  2861  2816  2715  2541  2341  2284  2280  2212  2110  1799  1544  1203
[37]  1144  1058   931   868   813   812   746   681   637   590   579   472
[49]   376   365

This function, however, is not useful and even dangerous within data frames since you only sort the values of the vector itself. To sort data in a data frame, you use the function order() which returns the indices of the vector. Hence, to sort the states based on their population, you use the following code.

head(sort(states$Population))
[1] 365 376 472 579 590 637
head(order(states$Population))
[1]  2 50 45  8 28 34
sort1.states <- states[order(states$Population), ]
head(sort1.states)
             Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alaska              365   6315        1.5    69.31   11.3    66.7   152 566432
Wyoming             376   4566        0.6    70.29    6.9    62.9   173  97203
Vermont             472   3907        0.6    71.64    5.5    57.1   168   9267
Delaware            579   4809        0.9    70.06    6.2    54.6   103   1982
Nevada              590   5149        0.5    69.03   11.5    65.2   188 109889
North Dakota        637   5087        0.8    72.78    1.4    50.3   186  69273
# sort by two variables
sort2.states <- states[order(states$Illiteracy, states$Income), ]
head(sort2.states)
             Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
South Dakota        681   4167        0.5    72.08    1.7    53.3   172  75955
Iowa               2861   4628        0.5    72.56    2.3    59.0   140  55941
Nevada              590   5149        0.5    69.03   11.5    65.2   188 109889
Vermont             472   3907        0.6    71.64    5.5    57.1   168   9267
Utah               1203   4022        0.6    72.90    4.5    67.3   137  82096
Idaho               813   4119        0.6    71.87    5.3    59.5   126  82677

By default, the sort order is increasing (or alphabetical in case of a character string). You can change this by setting decreasing = TRUE.

# sort in reverse order
sort3.states <- states[order(states$Life.Exp, decreasing = T), ]
head(sort3.states)
             Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
Hawaii              868   4963        1.9    73.60    6.2    61.9     0  6425
Minnesota          3921   4675        0.6    72.96    2.3    57.6   160 79289
Utah               1203   4022        0.6    72.90    4.5    67.3   137 82096
North Dakota        637   5087        0.8    72.78    1.4    50.3   186 69273
Nebraska           1544   4508        0.6    72.60    2.9    59.3   139 76483
Kansas             2280   4669        0.6    72.58    4.5    59.9   114 81787

4.2.4 Merging

To add a column to an existing data frame Df, you can either use Df$NewVar <- 1 or Df[["NewVar"]] <- 1.

Df <- data.frame(id = factor(1:12), group = factor(rep(1:2, each = 3)))
str(Df)
'data.frame':   12 obs. of  2 variables:
 $ id   : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ group: Factor w/ 2 levels "1","2": 1 1 1 2 2 2 1 1 1 2 ...
head(Df)
  id group
1  1     1
2  2     1
3  3     1
4  4     2
5  5     2
6  6     2
x <- rnorm(12)
y <- sample(70:100, 12)
x2 <- rnorm(12)
# add a column
Df$grade <- y  # or Df[["grade"]] <- y
head(Df)
  id group grade
1  1     1    70
2  2     1    75
3  3     1    78
4  4     2    74
5  5     2    80
6  6     2    76

To merge different data frames, you can use the function merge(). For this, we of course need a column in both data frames that’s a unique identifier for each of the observations. Alternatively, you can use cbind(). I don’t need to tell you that this of course comes with its own dangers and that you need to make sure that both data frames are then sorted.

Df2 <- data.frame(id = Df$id, y)
head(Df2)
  id  y
1  1 70
2  2 75
3  3 78
4  4 74
5  5 80
6  6 76
Df3 <- merge(Df, Df2, by = "id", sort = F) # using merge
head(Df3)
  id group grade  y
1  1     1    70 70
2  2     1    75 75
3  3     1    78 78
4  4     2    74 74
5  5     2    80 80
6  6     2    76 76
Df4 <- cbind(Df, x) # using cbind()
head(Df4)
  id group grade       x
1  1     1    70 -0.9587
2  2     1    75  1.8699
3  3     1    78  0.9086
4  4     2    74 -0.7590
5  5     2    80  1.5899
6  6     2    76  0.5970

To add rows to an existing data frame, you use the function rbind().

# add rows
Df2 <- data.frame(id = factor(13:24), 
                 group = factor(rep(1:2, e = 3)), grade = sample(y))
Df2
   id group grade
1  13     1    90
2  14     1    70
3  15     1    81
4  16     2    78
5  17     2    76
6  18     2    74
7  19     1    80
8  20     1    94
9  21     1    99
10 22     2    88
11 23     2    91
12 24     2    75
Df5 <- rbind(Df, Df2)
Df5
   id group grade
1   1     1    70
2   2     1    75
3   3     1    78
4   4     2    74
5   5     2    80
6   6     2    76
7   7     1    94
8   8     1    81
9   9     1    88
10 10     2    90
11 11     2    99
12 12     2    91
13 13     1    90
14 14     1    70
15 15     1    81
16 16     2    78
17 17     2    76
18 18     2    74
19 19     1    80
20 20     1    94
21 21     1    99
22 22     2    88
23 23     2    91
24 24     2    75
rm(list = ls()[grepl("Df", ls())]) # Clean environment

4.2.5 Aggregate

People experienced with SQL generally want to run an aggregation and group by as one of their first tasks with R. aggregate() splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

You will work with diamonds, a data set in the ggplot2 package containing the prices and other attributes of almost 54,000 diamonds. ggplot2 is a package authored and maintained by Hadley Wickham to `Create Elegant Data Visualisations Using the Grammar of Graphics’.

library(ggplot2)
head(diamonds)
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# average price for each type of cut
aggregate(price ~ cut, diamonds, mean)
        cut price
1      Fair  4359
2      Good  3929
3 Very Good  3982
4   Premium  4584
5     Ideal  3458
aggregate(diamonds$price, list(diamonds$cut), mean)
    Group.1    x
1      Fair 4359
2      Good 3929
3 Very Good 3982
4   Premium 4584
5     Ideal 3458
# do a manual check, using `subset()`
s <- subset(diamonds, cut == 'Fair')
mean(s$price)
[1] 4359

Hence, with the above code we tell R to split the data set diamonds into subsets according to the values of the variable cut and to calculate the mean for each of the subsets. Let’s break this code down to get a better grasp of what it does. 1. aggregate(x ~ SplitBy, Df, Function, ...) + x ~ SplitBy: We pass a formula which tells R to split the data frame Df according to the different values of the variable SplitBy and to pass the values of the variable x to the function Function. + Df: The data frame to use. + Function: The function that is computed on each of the subsets. + ... : Additional arguments that are passed to Function.

The ellipsis argument ... is incredibly useful to pass arguments to the function. This way we are able to adjust the default arguments of the function that is used on each of the subsets. For example, when we have missing values for the variable price, we can tell R to remove these when computing the mean by setting na.rm = TRUE. The argument na.rm of the function mean takes a logical value indicating whether NA values should be stripped before the computation proceeds (see ?mean)..

# add arguments to the function called
diamonds$price[sample(seq_len(nrow(diamonds)), 3, F)] <- NA
aggregate(price ~ cut, diamonds, mean, na.action = NULL) # na.action is set to NULL for illustration purposes
        cut price
1      Fair  4359
2      Good  3929
3 Very Good    NA
4   Premium    NA
5     Ideal  3458
aggregate(price ~ cut, diamonds, mean, na.rm = TRUE, na.action = NULL)
        cut price
1      Fair  4359
2      Good  3929
3 Very Good  3982
4   Premium  4585
5     Ideal  3458

The function aggregate() is often one of the first more ‘complex’ functions that you will use as a first-time user. If you are a bit confused, just try to play around with it a bit and have fun. Just load in the Pokémon data set from Kaggle (retrieved from https://www.kaggle.com/rounakbanik/pokemon) and play a bit with the aggregate function. Remember that it doesn’t always have to be serious (unless you want to be the very best and catch ’em all).

Pokemon <- read.csv("./data/pokemon.csv")
Pokemon$capture_rate <- gsub(" .*", "", Pokemon$capture_rate)
Pokemon$capture_rate <- as.numeric(Pokemon$capture_rate)
head(aggregate(capture_rate ~ type1, Pokemon, mean))
     type1 capture_rate
1      bug       119.83
2     dark        84.48
3   dragon        37.33
4 electric       106.97
5    fairy       116.94
6 fighting       103.75

Next to playing a bit around with the code, you also learn a great deal from other people’s code and this is why we also included the following useful illustrations:

s <- aggregate(price ~ cut, diamonds, mean)
s
        cut price
1      Fair  4359
2      Good  3929
3 Very Good  3982
4   Premium  4585
5     Ideal  3458
dd <- merge(diamonds, s, by = "cut", sort = "FALSE")
head(dd)
    cut carat color clarity depth table price.x    x    y    z price.y
1 Ideal  0.23     E     SI2  61.5  55.0     326 3.95 3.98 2.43    3458
2 Ideal  1.02     E     SI1  61.1  56.0    4675 6.51 6.45 3.96    3458
3 Ideal  1.05     F     SI2  60.9  56.0    4675 6.64 6.56 4.02    3458
4 Ideal  0.38     I     VS1  61.5  53.9     703 4.66 4.70 2.89    3458
5 Ideal  0.30     E     VS1  62.5  54.0     703 4.27 4.32 2.69    3458
6 Ideal  1.22     J     SI2  61.2  57.0    4676 6.86 6.90 4.21    3458
head(diamonds)
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
head(subset(dd, cut == "Very Good"))
            cut carat color clarity depth table price.x    x    y    z price.y
40249 Very Good  1.01     F     VS2  61.6    57    6773 6.39 6.46 3.96    3982
40250 Very Good  0.30     F    VVS1  61.8    55     783 4.32 4.35 2.68    3982
40251 Very Good  1.15     I     SI2  62.0    58    4405 6.69 6.73 4.16    3982
40252 Very Good  0.90     G     SI2  62.0    59    3445 6.14 6.19 3.82    3982
40253 Very Good  1.00     H    VVS2  62.6    56    6249 6.36 6.39 3.99    3982
40254 Very Good  0.36     E     VS2  62.1    59     789 4.55 4.59 2.84    3982
# change name of the column
names(dd)[names(dd) == 'price.y'] <- 'average price'
# add additional grouping variable
aggregate(price ~ cut + color, diamonds, mean, na.rm = TRUE)
         cut color price
1       Fair     D  4291
2       Good     D  3405
3  Very Good     D  3470
4    Premium     D  3631
5      Ideal     D  2629
6       Fair     E  3682
7       Good     E  3424
8  Very Good     E  3215
9    Premium     E  3540
10     Ideal     E  2598
11      Fair     F  3827
12      Good     F  3496
13 Very Good     F  3778
14   Premium     F  4326
15     Ideal     F  3375
16      Fair     G  4239
17      Good     G  4123
18 Very Good     G  3873
19   Premium     G  4501
20     Ideal     G  3721
21      Fair     H  5136
22      Good     H  4276
23 Very Good     H  4535
24   Premium     H  5217
25     Ideal     H  3889
26      Fair     I  4685
27      Good     I  5079
28 Very Good     I  5256
29   Premium     I  5946
30     Ideal     I  4452
31      Fair     J  4976
32      Good     J  4574
33 Very Good     J  5104
34   Premium     J  6295
35     Ideal     J  4918
# store results in an object
res <- aggregate(price ~ cut + color, diamonds, mean, na.rm = TRUE)
str(res)
'data.frame':   35 obs. of  3 variables:
 $ cut  : Ord.factor w/ 5 levels "Fair"<"Good"<..: 1 2 3 4 5 1 2 3 4 5 ...
 $ color: Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 1 1 1 1 1 2 2 2 2 2 ...
 $ price: num  4291 3405 3470 3631 2629 ...
head(res)
        cut color price
1      Fair     D  4291
2      Good     D  3405
3 Very Good     D  3470
4   Premium     D  3631
5     Ideal     D  2629
6      Fair     E  3682
# aggregate two variables, combine with 'cbind'
aggregate(cbind(price, carat) ~ cut, diamonds, mean)
        cut price  carat
1      Fair  4359 1.0461
2      Good  3929 0.8492
3 Very Good  3982 0.8064
4   Premium  4585 0.8920
5     Ideal  3458 0.7028
aggregate(cbind(price, carat) ~ cut + color, diamonds, mean)
         cut color price  carat
1       Fair     D  4291 0.9201
2       Good     D  3405 0.7445
3  Very Good     D  3470 0.6964
4    Premium     D  3631 0.7215
5      Ideal     D  2629 0.5658
6       Fair     E  3682 0.8566
7       Good     E  3424 0.7451
8  Very Good     E  3215 0.6763
9    Premium     E  3540 0.7179
10     Ideal     E  2598 0.5784
11      Fair     F  3827 0.9047
12      Good     F  3496 0.7759
13 Very Good     F  3778 0.7408
14   Premium     F  4326 0.8271
15     Ideal     F  3375 0.6558
16      Fair     G  4239 1.0238
17      Good     G  4123 0.8509
18 Very Good     G  3873 0.7668
19   Premium     G  4501 0.8415
20     Ideal     G  3721 0.7007
21      Fair     H  5136 1.2192
22      Good     H  4276 0.9147
23 Very Good     H  4535 0.9159
24   Premium     H  5217 1.0164
25     Ideal     H  3889 0.7995
26      Fair     I  4685 1.1981
27      Good     I  5079 1.0572
28 Very Good     I  5256 1.0470
29   Premium     I  5946 1.1449
30     Ideal     I  4452 0.9130
31      Fair     J  4976 1.3412
32      Good     J  4574 1.0995
33 Very Good     J  5104 1.1332
34   Premium     J  6295 1.2931
35     Ideal     J  4918 1.0636

4.3 Exploratory Data Analysis (EDA)

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others. (Quote from (Grolemund and Wickham 2016))

4.3.1 Exploring a numerical variable

You will work with the CPS1985 data from the AER package that accompanies (Kleiber and Zeileis 2008).

library(AER)
data("CPS1985")
str(CPS1985)
'data.frame':   534 obs. of  11 variables:
 $ wage      : num  5.1 4.95 6.67 4 7.5 ...
 $ education : num  8 9 12 12 12 13 10 12 16 12 ...
 $ experience: num  21 42 1 4 17 9 27 9 11 9 ...
 $ age       : num  35 57 19 22 35 28 43 27 33 27 ...
 $ ethnicity : Factor w/ 3 levels "cauc","hispanic",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ region    : Factor w/ 2 levels "south","other": 2 2 2 2 2 2 1 2 2 2 ...
 $ gender    : Factor w/ 2 levels "male","female": 2 2 1 1 1 1 1 1 1 1 ...
 $ occupation: Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sector    : Factor w/ 3 levels "manufacturing",..: 1 1 1 3 3 3 3 3 1 3 ...
 $ union     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ married   : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 2 1 ...
head(CPS1985) 
      wage education experience age ethnicity region gender occupation
1     5.10         8         21  35  hispanic  other female     worker
1100  4.95         9         42  57      cauc  other female     worker
2     6.67        12          1  19      cauc  other   male     worker
3     4.00        12          4  22      cauc  other   male     worker
4     7.50        12         17  35      cauc  other   male     worker
5    13.07        13          9  28      cauc  other   male     worker
            sector union married
1    manufacturing    no     yes
1100 manufacturing    no     yes
2    manufacturing    no      no
3            other    no      no
4            other    no     yes
5            other   yes      no
summary(CPS1985$wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.25    7.78    9.02   11.25   44.50 

Let’s start with the variable wage and quickly illustrate one of the dangers of using attach. The warning message tells us that, when we use the object wage, it will use the one that was created in our working directory/global environment.

wage <- 25
attach(CPS1985) # the warning message already tells you that it will use the object wage that was created before
The following object is masked _by_ .GlobalEnv:

    wage
mean(wage)
[1] 25
mean(CPS1985$wage)
[1] 9.024
detach(CPS1985)
# Use the function searchpaths() to see the 'hierarchy'

So instead of using attach, just use the $ operator or export it to your global environment. If you don’t want to type the full name of your data frame when using $, you can just rename it to something short such as Df and this makes typing Df$Variable a whole lot faster.

# attach the data set; R knows where to find the variables
wage <- CPS1985$wage
summary(wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.25    7.78    9.02   11.25   44.50 
is.numeric(wage)
[1] TRUE
mean(wage)
[1] 9.024
median(wage)
[1] 7.78
fivenum(wage)   # Tukey's five number summary
[1]  1.00  5.25  7.78 11.25 44.50
min(wage)
[1] 1
max(wage)
[1] 44.5
var(wage)
[1] 26.41
sd(wage)
[1] 5.139
hist(wage, freq = FALSE)

hist(log(wage),
     freq = FALSE,
     nclass = 20,
     col = "pink")
lines(density(log(wage)), col = 4)

4.3.2 Exploring a categorical (or: factor) variable

occupation <- CPS1985$occupation
str(occupation) # factor variable with 6 levels
 Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(occupation)
    worker  technical   services     office      sales management 
       156        105         83         97         38         55 
nlevels(occupation)
[1] 6
levels(occupation)
[1] "worker"     "technical"  "services"   "office"     "sales"     
[6] "management"

To compactify the output you will rename levels 2 and 6 of the factor variable ‘occupation’.

levels(occupation)[c(2, 6)] <- c("techn", "mgmt")
summary(occupation)
  worker    techn services   office    sales     mgmt 
     156      105       83       97       38       55 

Now you’ll learn how to construct summary tables, barplots and pie charts in R.

tab <- table(occupation)
tab
occupation
  worker    techn services   office    sales     mgmt 
     156      105       83       97       38       55 
prop.table(tab)
occupation
  worker    techn services   office    sales     mgmt 
 0.29213  0.19663  0.15543  0.18165  0.07116  0.10300 
barplot(tab)

pie(tab)

pie(tab,col = gray(seq(0.4, 1.0, length = 6)))

4.3.3 Exploring two categorical (or: factor) variables

gender <- CPS1985$gender
table(gender, occupation)
        occupation
gender   worker techn services office sales mgmt
  male      126    53       34     21    21   34
  female     30    52       49     76    17   21
prop.table(table(gender, occupation))
        occupation
gender    worker   techn services  office   sales    mgmt
  male   0.23596 0.09925  0.06367 0.03933 0.03933 0.06367
  female 0.05618 0.09738  0.09176 0.14232 0.03184 0.03933
prop.table(table(gender, occupation), 2)
        occupation
gender   worker  techn services office  sales   mgmt
  male   0.8077 0.5048   0.4096 0.2165 0.5526 0.6182
  female 0.1923 0.4952   0.5904 0.7835 0.4474 0.3818
# use mosaic plot 
plot(gender ~ occupation, data = CPS1985)

4.3.4 Exploring one numerical and one categorical variable

# here: apply 'mean(.)' to 'log(wage)' by 'gender'
tapply(wage, gender, mean)
  male female 
 9.995  7.879 
options(digits=5)
tapply(log(wage), list(gender, occupation), mean)
       worker  techn services office  sales   mgmt
male   2.1004 2.4466   1.8296 1.9553 2.1411 2.4475
female 1.6679 2.3075   1.7017 1.9311 1.5794 2.2293
# let's check these results
# use subset(.) to extract part of the data
s <- subset(CPS1985, select=c(gender, occupation, wage))
s1 <- subset(s, gender == "female" & occupation == "technical")
mean(log(s1$wage))
[1] 2.3075

Now you’ll build an appropriate visualization tool.

# see e.g. http://www.r-bloggers.com/box-plot-with-r-tutorial/
boxplot(log(wage) ~ gender)

boxplot(log(wage) ~ gender + occupation, col="light blue")

boxplot(log(wage) ~ gender + occupation, col="light blue", las=2)
# make it a nice graph
.pardefault <- par(no.readonly = T) # to store the default settings of par(.)
boxplot(log(wage) ~ gender + occupation, col="light blue", las=2, par(mar = c(12, 5, 4, 2) + 0.1))

par(.pardefault)

4.4 Exercises

Learning check

  1. Import the data set na.txt that is available in the folder ‘data’ that comes with the book.
  • Use read.table and interpret the resulting data frame.
  • Do you detect any problems (wrt missing values, strange observations)? Check for missing values using the is.na funtion applied to a variable from the na data set.
  • If so, try solving those using the arguments of the read.table function. [Hint: check the argument na.strings] Check again for missing values.
  • Make sure female is a factor variable (with two levels).
  • Count the number of missing values per variable.
  1. (An exercise taken from (Kleiber and Zeileis 2008)) “PARADE” is the Sunday newspaper magazine supplementing the Sunday or weekend edition of some 500 daily newspapers in the United States of America. An important yearly feature is an article providing information on some 120150 “randomly” selected US citizens, indicating their profession, hometown and state, and their yearly earnings. The Parade2005 (in library AER) data contain the 2005 version, amended by a variable indicating celebrity status (motivated by substantial oversampling of celebrities in these data). For the Parade2005 data:
  • Load the data Parade2005 from the AER package, use data("Parade2005") to make the data accessible.
  • Determine the mean earnings in California.
  • Determine the number of individuals residing in Idaho.
  • Determine the mean and the median earnings of celebrities.
  • Obtain boxplots of log(earnings) stratified by celebrity
  • Plot the density of log(earnings), use density.
  1. You will almost always receive a clean data set when you have to use it for an assignment or when it’s used in a course to illustrate something. In real-life, however, this will almost never be the case. There will be errors and it’s important that you are able to handle these or at least notice these. Garbage in = Garbage out. So for this exercise, you have to look for inconsistencies in the data and use the basic data handling steps to solve them. We adjusted the state.x77 data set and you can find it in the data folder as the file States.csv. Load the package datasets, if you haven’t already, and run the command ?state.x77 to obtain information on the variables.
  • Import the csv file. Remember that there are multiple ways to import it and that a lot of things can go wrong (which is the goal of the exercise, we are terribly sorry). It’s completely normal to get the error message Error ... : more columns than column names, since we did this on purpose. You will hate us now, but thank us later. Hint: "\t" is used for tab.
  • Check if the file was imported correctly. Next to delimiters for the column values, there’s also something called a decimal separator. As you might have expected, we also made sure that you cannot use the default argument here.
  • Inspect the data frame and check if you notice something weird. Which variables contain some weird looking values?
  • Some states claim that Jesus was executed in their state and that he resurrected. Can you find out which states claim to have witnessed this miracle? Use the function pmax to fix this problem.
  • The population estimate is not correct for some states. Set these to NA and perform mean imputation for these cases (i.e. replace it with the mean of the available cases). Do the same for other variables with impossible/improbable values.
  • The variables that had values replaced were Murder, Illiteracy, Population, Life.Exp and Area.