3 Objects and data types in R

3.1 How it works

You will now start with writing R code in the console and you will explore a first script of R code. Every line of code is interpreted and executed by R. Once R is done computing, the output of your R code will be shown in the console. In some cases, however, something might go wrong (e.g. due to an error in the code) and then you will either get a warning or an error. R makes use of the # sign to add comments, so that you and others can understand what the R code is about. Just like Twitter! Luckily, here your comments are not limited to 280 characters. When passing lines of code preceded by # to the R console, these will simply be ignored and hence, will not influence your results. [Quote from DataCamp’s ‘Introduction to R’ course.] In its most basic form, R can be used as a simple calculator. We illustrate the use of some arithmetic operators in the code below.

# use 'right click, run line or selection', of Ctrl + R
10^2 + 36
[1] 136

3.2 Objects

A basic concept in (statistical) programming is called a variable and in R, this is commonly referred to as an object. An object allows you to store a value (e.g. 4) or a more complex data structure (e.g. a database). You can then later use this object’s name to easily access the value or the data structure that is stored within this object. [Quote from DataCamp’s ‘Introduction to R’ course.]

We create an object by giving it a name and using the assignment operator <- or -> to assign a value to this object (Douglas et al. 2020). The value gets stored into the object to which the arrow is pointing. You can then view the value of the object by passing it to the console and the value will then be given as output.

HappyObject <- 1
-1 -> SadObject
HappyObject
[1] 1
SadObject # Don't be so negative
[1] -1

Can you guess what the output will be for the following code?

HappyObject -> SadObject
IAmConfused <- SadObject
IAmConfused
[1] 1

Once we have created an object, we can easily perform some calculations with it.

HappyObject * 5
[1] 5
(HappyObject + 10) / 2
[1] 5.5
SadObject^2
[1] 1

Further, = is an alternative assignment operator to <-, but is often discouraged for people new to R. The <- operator is considered to be more important by R and precedes = in importance (for a more detailed explanation see https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r). In most contexts, however, = can be used as a safe alternative (Venables, Smith, and R Core Team 2020). Just know that you should use it with care.

a <- b = 2      # throws an error, these 2 operators should not be mixed
mean(b = 5:10)  # b is not an argument in this function and the object b is not created
mean(b <- 5:10) # here, b is created and then considered to be the argument of the function
b

In addition, the code above illustrates that, within functions, = is reserved to assign objects to the arguments.


3.3 Everything is an object

In R, an analysis is normally broken down into a series of steps. Intermediate results are stored in objects, with minimal output at each step (often none). Instead, the objects are further manipulated to obtain the information required. In fact, the fundamental design principle underlying R (and S) is “everything is an object”. Hence, not only vectors and matrices are objects that can be passed to and returned by functions, but also functions themselves, and even function calls. (Quote from ‘Applied Econometrics in R’, by Kleiber & Zeileis) A variable in R can take on any available data type, or hold any R object.

# see all objects stored in R's memory, where 'ls()' is for 'List Objects' 
# and returns a vector of character strings
# giving the names of the objects in the specified environment
rm(list = ls()[!grepl("Object|Confused", ls(), perl = T)]) # Clean environment to have a short list
ls()
[1] "HappyObject" "IAmConfused" "SadObject"  
# to remove objects from R's memory, use
rm(SadObject)
ls()
[1] "HappyObject" "IAmConfused"
a <- 1
b <- 2
c <- 3
d <- 4
rm(a, b)
rm(list = c('c', 'd'))
a <- 1
b <- 2
c <- 3
d <- 4
# with the following code, you will remove everything in your working directory
rm(list = ls())

All objects that you create, are stored in your current workspace and in RStudio you can view the list of objects by clicking on the ‘Environment’ tab in the top right hand pane. This workspace is also referred to as the global environment and this is where all the interactive computations take place (i.e. outside of a function) (Wickham 2019).

Without going to much into the technical details, we can sort of compare your workspace with your own, special sandbox.

Everything that you create in your sandbox, stays there and gets saved in your .RData file when you close your session. When creating an Rstudio project, this RData gets automatically imported (with the default settings) when you open your project again and with this, your session gets ‘restored’ as it contains all objects you created last time. When creating a new project in a different directory, you create a new sandbox and this makes it easy to structure all of your different projects and analyses.


3.4 Basic data types

R works with numerous data types. Some of the most basic types to get started with are:

  • Decimal values like 4.5 are called numerics.
  • Natural numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Dates or POSIXct for time based variables. Here, Date stores just a date and POSIXct stores a date and time. Both objects are actually represented as the number of days (Date) or seconds (POSIXct) since January 1, 1970.
  • Text (or string) values are called characters.

Note how the quotation marks on the right indicate that “some text” is a character.

my_numeric <- 42.5

my_character <- "some text"

my_logical <- TRUE

my_date <- as.Date("05/29/2018", "%m/%d/%Y")

Note that the logical values TRUE and FALSE can also be abbreviated as T and F, respectively.

T
[1] TRUE
F
[1] FALSE

You can check the data type of an object beforehand. You can do this with the class() function.

class(my_numeric)
[1] "numeric"
# your turn to check the type of 'my_character' and 'my_logical' and 'my_date'

When you are interested if an object is of a certain type, you can use the following functions:

is.numeric(my_numeric)
[1] TRUE
is.character(my_numeric)
[1] FALSE
is.character(my_character)
[1] TRUE
is.logical(my_logical)
[1] TRUE

This is incredibly useful when you have to check the input that’s passed to a self-written function and to prevent that objects of a wrong type get passed. In addition, as you might have noticed, there’s no function is.Date. No need to worry, however, because R’s flexibility allows us to create a function like this ourselves, but we’ll go over it more in detail in Chapter 8. For now, just know that you can alternatively use the function inherits or is

inherits(my_date, "Date")
[1] TRUE
is(my_date, "Date")
[1] TRUE

3.5 Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. (Quote from DataCamp’s ‘Introduction to R course’) Vectors are key! Operations are applied to each element of the vector automatically, there is no need to loop through the vector.

# To combine elements into a vector, use c():
a = c(1, 2, 3, 4)

Once we have created this vector, we can pass it to functions to gather some useful information about it.

?min
?max
?mean
?sd
?var

min(a)
max(a)
mean(a)
sd(a)
var(a)

In addition to the above functions, length is another function that’s incredibly useful and one of the functions you will use a lot. When passing a vector to this function, it returns the number of elements that it contains

length(a)
[1] 4

Often, we want to create a vector that’s a sequence of numbers. In this case, we can use the : symbol to create a sequence of values in steps of one (Douglas et al. 2020). Alternatively, we can use the function seq which allows for more flexibility.

# steps of one
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
seq_len(10)
 [1]  1  2  3  4  5  6  7  8  9 10
# specify the steps yourself
seq(from = 0, to = 10, by = 0.5)
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
[16]  7.5  8.0  8.5  9.0  9.5 10.0
# or the length of the vector, and the steps will be computed by R
seq(from = 0, to = 10, length = 6)
[1]  0  2  4  6  8 10

When we need to repeat certain values, we can use the rep function.

rep(1, times = 5)
[1] 1 1 1 1 1
rep(1:5, times = 5)
 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1:5, each = 2)
 [1] 1 1 2 2 3 3 4 4 5 5

3.5.1 Vector indexing

To access certain elements of a vector, we use the square brackets []. For example,

Abra = 1:5
Abra[1]
[1] 1
Abra[5]
[1] 5

To select a subset of elements, we can specify an index vector (Venables, Smith, and R Core Team 2020) that specifies which elements should be selected and in which order.

Abra[c(2, 4)]
[1] 2 4
Abra[c(4, 2)]
[1] 4 2

The index vector can be of four different types (Venables, Smith, and R Core Team 2020):

  1. A logical vector.
Abra[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
[1] 1 3 4
Kadabra <- Abra > 3
Kadabra
[1] FALSE FALSE FALSE  TRUE  TRUE
Abra[Kadabra]
[1] 4 5
  1. A vector with positive indices, which specifies which elements should be selected.
Abra[1:3]
[1] 1 2 3
  1. A vector with negative indices, which specifies which elements should be excluded.
Abra[-c(1:3)]
[1] 4 5
  1. A vector of character strings, in case of a named vector. This is then similar to the index vector with positive indices, but now we select the items based on their names. This will be particularly useful later on, when we are working with data frames.
a <- 1:3
names(a) <- c("Squirtle", "Bulbasaur", "Charmander")
a
  Squirtle  Bulbasaur Charmander 
         1          2          3 
a["Squirtle"]
Squirtle 
       1 
# or
IChooseyou <- c("Charmander")
a[IChooseyou]
Charmander 
         3 

Next to selecting elements, we can also use this to perform an operation on these elements only.

a[1] = 25
a
  Squirtle  Bulbasaur Charmander 
        25          2          3 

3.5.2 Character and logical vectors

A vector can either hold numeric, character or logical values.

family <- c("Katrien", "Jan", "Leen")
family
[1] "Katrien" "Jan"     "Leen"   
family[2]
[1] "Jan"
str(family) # str() displays the structure of an R object in compact way
 chr [1:3] "Katrien" "Jan" "Leen"
class(family)
[1] "character"

In addition, you can give a name to the elements of a vector with the names() function. Here is how it works

my_vector <- c("Katrien Antonio", "teacher")
names(my_vector) <- c("Name", "Profession")
my_vector
             Name        Profession 
"Katrien Antonio"         "teacher" 

Important to remember is that a vector can only hold elements of the same type. Consequently, when you specify elements of different types in a vector, it saves it to that type that contains the most information (logical < numeric < character).

c(0, TRUE)
[1] 0 1
c(0, "Character")
[1] "0"         "Character"

3.5.3 Missing values

When working with real-life data, you are confronted with missing data more often than you’d care to admit. The values are indicated by NA and any operation on this value will remain NA. To assess which elements are missing in a vector, you can use the function is.na.

a <- c(1:2, NA, 4:5)
a
[1]  1  2 NA  4  5
is.na(a)
[1] FALSE FALSE  TRUE FALSE FALSE

As it returns a logical vector, we can use it as an index vector.

a[is.na(a)]
[1] NA

3.5.4 Logical operators

We are able to create logical expressions using the logical operators <, <=, >, >=, ==, where the last one is reserved exact equality. This enables us to select subset of elements. Further, we can combine logical expressions using & or | to denote their intersection or union, respectively.

a <- 1:5
a > 3
[1] FALSE FALSE FALSE  TRUE  TRUE
a == 3
[1] FALSE FALSE  TRUE FALSE FALSE
a[a > 2 & a < 4]
[1] 3
a[a == 3 | a == 5]
[1] 3 5

To get the negation of a logical expression, we make use of the ! operator.

FALSE
[1] FALSE
!FALSE
[1] TRUE
b <- c(TRUE, FALSE, TRUE, TRUE)
!b
[1] FALSE  TRUE FALSE FALSE

This ! operator can then be used for a whole range of useful manipulations. Going back to the vector with missing values, we can use this to exclude the missing values in the vector.

a = c(1:2, NA, 4:5)
a[!is.na(a)]
[1] 1 2 4 5
na.omit(a) # alternative to omit missing values
[1] 1 2 4 5
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"

The above also illustrates that we can combine multiple statements or manipulations in one line of code. Combining them gives us a very powerful tool to handle and analyze data in an efficient way.

a <- -5:5
max(a[a > 0 & a <= 3])
[1] 3

3.5.5 Factors

To specify that you have a vector with a discrete classification, we make use of a factor object which can either be ordered or unordered. These are mainly used in formulae, but we will already introduce the basics here.

Fruits <- c("Apple", "Banana", "Grape", "Lemons")
Fruits <- factor(Fruits)
Var    <- rep(1:4, each = 2)
Var    <- factor(Var, levels = 1:4, labels = c("Apple", "Banana", "Grape", "Lemons"))
Var
[1] Apple  Apple  Banana Banana Grape  Grape  Lemons Lemons
Levels: Apple Banana Grape Lemons
levels(Var)
[1] "Apple"  "Banana" "Grape"  "Lemons"
nlevels(Var)
[1] 4

Be careful, however, when converting factor variables to numeric. The factor variables have an underlying numeric value assigned to them and you should therefore always be careful when converting them.

as.numeric(Var)
[1] 1 1 2 2 3 3 4 4
a <- as.character(c(3, 5, 29, 5))
a <- factor(a)
a
[1] 3  5  29 5 
Levels: 29 3 5
as.numeric(a)
[1] 2 3 1 3

3.6 Matrices

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. You can construct a matrix in R with the matrix() function. (Quote from DataCamp’s ‘Introduction to R course’)

# a 3x4 matrix, filled with 1,2,..., 12
matrix(1:12, 3, 4, byrow = TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
matrix(1:12, byrow = TRUE, nrow = 3)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
# hmmm, check help on 'matrix'
? matrix

In addition to the function matrix, we can also create matrices by combining vectors through use of the cbind and rbind functions.

# one way of creating matrices is to bind vectors together
cbind(1:2, 6:9)     # by columns
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    1    8
[4,]    2    9
rbind(1:3, -(1:3))  # by rows
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   -1   -2   -3
m <- cbind(a = 1:3, b = letters[1:3])
m
     a   b  
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
rbind(a = 1:3, b = letters[1:3])
  [,1] [,2] [,3]
a "1"  "2"  "3" 
b "a"  "b"  "c" 
# ask help, what is the built-in 'letters'?
? letters

3.6.1 Matrix operations and indexing

Matrices and their theory are an essential part of linear algebra and R therefore has a lot of functions specifically designed for matrices.

# create matrix object 'm'
x <- matrix(1:12, 3, 4)
x
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
nrow(x)
[1] 3
ncol(x)
[1] 4
dim(x)
[1] 3 4
t(x)    # matrix transpose
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
x = matrix(1:4, nrow = 2)
x %o% x          # outer product
, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    2    6
[2,]    4    8

, , 1, 2

     [,1] [,2]
[1,]    3    9
[2,]    6   12

, , 2, 2

     [,1] [,2]
[1,]    4   12
[2,]    8   16
outer(x, x, "*") # alternative
, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    2    6
[2,]    4    8

, , 1, 2

     [,1] [,2]
[1,]    3    9
[2,]    6   12

, , 2, 2

     [,1] [,2]
[1,]    4   12
[2,]    8   16
diag(x)          # extract diagonal elements
[1] 1 4
det(x)           # determinant
[1] -2
eigen(x)         # eigenvalues and eigenvectors
eigen() decomposition
$values
[1]  5.3723 -0.3723

$vectors
        [,1]    [,2]
[1,] -0.5658 -0.9094
[2,] -0.8246  0.4160

An important difference with other statistical software programs, is that * is used for element-wise multiplication. When you want to multiply matrices, you should use the %*% operator.

x * x        # element-wise multiplication
     [,1] [,2]
[1,]    1    9
[2,]    4   16
t(x) %*% x   # use %*% for matrix multiplication
     [,1] [,2]
[1,]    5   11
[2,]   11   25
crossprod(x) # alternative to t(x) %*% x
     [,1] [,2]
[1,]    5   11
[2,]   11   25
x %*% t(x)
     [,1] [,2]
[1,]   10   14
[2,]   14   20
tcrossprod(x)
     [,1] [,2]
[1,]   10   14
[2,]   14   20

Further, to get the inverse of a matrix, we use the solve function.

solve(x)
     [,1] [,2]
[1,]   -2  1.5
[2,]    1 -0.5

To select a subset of elements of a matrix, we again use vector indices within the square brackets []. When we only want to select certain rows, columns or both, we put a comma in the square brackets.

x[1:5]  # select first 5 elements, starts from 1st element from the 1st column and proceeds to the next elements in the 1st column
[1]  1  2  3  4 NA
x[1, ]  # select first row
[1] 1 3
x[, 1]  # select first column
[1] 1 2
x[2, 2] # select fourth element in fourth column
[1] 4

3.7 Lists

A list in R allows you to gather a variety of objects under one object in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way. You could say that a list is some kind super data type: you can store practically any piece of information in it! (Quote from DataCamp’s ‘Introduction to R course’)

# a first example of a list
L <- list(one = 1, two = c(1, 2), five = seq(1, 4, length=5),
          six = c("Katrien", "Jan"))
names(L)
[1] "one"  "two"  "five" "six" 
summary(L)
     Length Class  Mode     
one  1      -none- numeric  
two  2      -none- numeric  
five 5      -none- numeric  
six  2      -none- character
class(L)
[1] "list"
str(L)
List of 4
 $ one : num 1
 $ two : num [1:2] 1 2
 $ five: num [1:5] 1 1.75 2.5 3.25 4
 $ six : chr [1:2] "Katrien" "Jan"
# list within a list
# a list containing: a sample from a N(0,1), plus some markup
# list within list
mylist <- list(sample = rnorm(5), family = "normal distribution", parameters = list(mean = 0, sd = 1))
mylist
$sample
[1]  1.1305  1.7639 -0.8482 -0.6672 -1.4533

$family
[1] "normal distribution"

$parameters
$parameters$mean
[1] 0

$parameters$sd
[1] 1
str(mylist)
List of 3
 $ sample    : num [1:5] 1.13 1.764 -0.848 -0.667 -1.453
 $ family    : chr "normal distribution"
 $ parameters:List of 2
  ..$ mean: num 0
  ..$ sd  : num 1

The objects stored on the list are known as its components (Venables, Smith, and R Core Team 2020) and to access these components, we either use a numerical value indicating the position in the list or the name of the component (only possible when it has been given a name of course).

# now check
mylist[[1]]
[1]  1.1305  1.7639 -0.8482 -0.6672 -1.4533
mylist[["sample"]]
[1]  1.1305  1.7639 -0.8482 -0.6672 -1.4533

If the components have names, we can also access them using the $ operator in the following way.

mylist$sample
[1]  1.1305  1.7639 -0.8482 -0.6672 -1.4533
mylist$parameter
$mean
[1] 0

$sd
[1] 1
mylist$parameters$mean
[1] 0

Moreover, we can even access the elements of the component in the same way as we did before.

mylist[[1]][2:4]
[1]  1.7639 -0.8482 -0.6672

To access lists within lists, we use the following code

Dream = list(WithinADream = list(WithinAnotherDream = "DieTraumdeutung"))
Dream$WithinADream$WithinAnotherDream
[1] "DieTraumdeutung"
Dream[[1]][[1]]
[1] "DieTraumdeutung"

We use double square brackets to get the component in its original form. If we just use single brackets, we get it as an object of class list.

Dream = list(WithinADream = "SomethingFunny")
class(Dream[[1]])
[1] "character"
class(Dream[1])
[1] "list"

3.8 Data frames

Most data sets you will be working with will be stored as data frames. A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

First, you will look at a ‘classic’ data set from the datasets package that comes with the base R installation. The mtcars (Motor Trend Car Road Tests) data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). (Quote from DataCamp’s ‘Introduction to R course’)

mtcars
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Since using built-in data sets is not even half the fun of creating your own data sets, you will now work with your own personally created data set. (Quote from DataCamp’s ‘Introduction to R course’)

Df <- data.frame(x = c(11, 12, 7), y = c(19, 20, 21), z = c(10, 9, 7))
# quick scan of the object 't'
summary(Df)
       x              y              z        
 Min.   : 7.0   Min.   :19.0   Min.   : 7.00  
 1st Qu.: 9.0   1st Qu.:19.5   1st Qu.: 8.00  
 Median :11.0   Median :20.0   Median : 9.00  
 Mean   :10.0   Mean   :20.0   Mean   : 8.67  
 3rd Qu.:11.5   3rd Qu.:20.5   3rd Qu.: 9.50  
 Max.   :12.0   Max.   :21.0   Max.   :10.00  
str(Df)
'data.frame':   3 obs. of  3 variables:
 $ x: num  11 12 7
 $ y: num  19 20 21
 $ z: num  10 9 7
# another way to create the same data frame
x <- c(11, 12, 7)
y <- c(19, 20, 21)
z <- c(10, 9, 7)
Df <- data.frame(x, y, z)

Accessing elements in a data frame is similar to how we access elements in a matrix. We can again use an index vector to access either the rows, columns or both. In addition, similar to lists, we can access columns using the $ operator or using the double square brackets.

Df[1:2, ]
   x  y  z
1 11 19 10
2 12 20  9
Df[, 2:3]
   y  z
1 19 10
2 20  9
3 21  7
Df$x
[1] 11 12  7
Df[["x"]]
[1] 11 12  7
Df[[1]]
[1] 11 12  7

In essence, a data frame can be seen as a combination of a list and a matrix. The variables are its components and the object has a separate class "data.frame" (Venables, Smith, and R Core Team 2020).

is.list(Df)
[1] TRUE
class(Df)
[1] "data.frame"

But that’s enough technical stuff for now, let’s do our first data exploration and calculate the mean of the variable z in data frame t!

mean(Df$z)   
[1] 8.667
mean(z)   # does not work, why not?
[1] 8.667

The code mean(z) doesn’t work, because z wasn’t defined in the global environment but only within your data frame. Going back to the sandbox analogy, you can look at the data frame as a mini-sandbox within your bigger sandbox. Everything that gets defined in this sandbox, stays there. This way, we keep our sandbox nice and organized. Just imagine the mess when all of your variables of your data frame would just float around in your sandbox.

One ‘dirty’ way to access the variables in your data frame without specifying the said data frame, is to use the attach function. With this function, we tell R that it also has to search within the attached data frame.

rm(x, y, z) # Remove variables
attach(Df)
mean(z)
[1] 8.667
detach(Df)

Using attach, however, can be dangerous. If you created an object with a similar name to a variable in your data frame, R will not use the variable in your data frame but the one that was created before.

x = rnorm(1e2)
z = "KadabraCastsConfusion"
attach(Df)
The following objects are masked _by_ .GlobalEnv:

    x, z
mean(x)
[1] 0.002858
mean(Df$x)
[1] 10
mean(z)
Warning in mean.default(z): argument is not numeric or logical: returning NA
[1] NA
detach(Df)

One way to avoid this, is to use the function with.

with(Df, mean(z))
[1] 8.667

More on data frames

# this does not work
# Df <- data.frame(x = c(11,12), y = c(19,20,21), z = c(10,9,7)) 
# but you _can_ do
Df <- data.frame(x = c(11, 12, NA), y = c(19, 20, 21), z = c(10, 9, 7))
# data frame with different types of information
b <- data.frame(x = c(11, 12, NA), y = c("me", "you", "everyone"))
str(b)
'data.frame':   3 obs. of  2 variables:
 $ x: num  11 12 NA
 $ y: chr  "me" "you" "everyone"

In previous versions of R, character variables in a data frame were automatically converted to factor variables. They were briefly mentioned before and in essence, factor variables are used to store categorical variables (i.e. nominal, ordinal or dichotomous variables). Categorical variables can only take on a limited number of values. Conversely, continuous variables can take on an uncountable set of values. If you want to R to convert the variables with character strings to factor variables when creating a data frame, just specify stringsAsFactors = TRUE.

b <- data.frame(x = c(11, 12, NA), y = c("me", "you", "everyone"), stringsAsFactors = TRUE)
str(b)
'data.frame':   3 obs. of  2 variables:
 $ x: num  11 12 NA
 $ y: Factor w/ 3 levels "everyone","me",..: 2 3 1

3.9 Exercises

Learning check

  1. Explore the objects and data types in R.
  • Create a vector fav_music with the names of your favorite artists.
  • Create a vector num_records with the number of records you have in your collection of each of those artists.
  • Create vector num_concerts with the number of times you attended a concert of these artists.
  • Put everything together in a data frame, assign the name my_music to this data frame and change the labels of the information stored in the columns to artist, records and concerts.
  • Extract the variable num_records from the data frame my_music. Calculate the total number of records in your collection (for the defined set of artists). Check the structure of the data frame, ask for a summary.