8 Writing functions in R
8.1 Conditionals and control flow
In this chapter you’ll learn about relational operators to compare R objects and logical operators to combine logical expressions. Next, you’ll use this knowledge to build conditional statements. [Quote from DataCamp’s `Intermediate R’ course]
Make sure not to mix up ==
and =
, where the latter is used for assignment and the former checks equality (see Section 3.2).
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Now you’ll focus on inequalities.
[1] FALSE
[1] FALSE
[1] FALSE
For string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE
corresponds to 1 in R, and FALSE
coerces to 0 behind the scenes.
R’s relational operators also work on vectors.
[1] TRUE TRUE FALSE FALSE TRUE
[1] FALSE FALSE FALSE TRUE TRUE
8.2 Logical operators
We already discussed the logical operators in Chapter 3, so you already know the basics. Can you predict what the outcome will be of the following statements?
The logical operators applied to vectors
[1] FALSE FALSE FALSE FALSE TRUE
8.3 Conditional statements
We encounter conditional statements on a daily basis. Just think of the following behavior that perfectly illustrates this.
Hahahaha =D
*Cough* awkward silence...
With this, we illustrated the basic concept of a conditional statement and with it, a conditional execution. We basically read this code as if
this statement is TRUE
then we do this. If it is FALSE
, we do another thing. With the above example, it is quite obvious. If the joke is funny, we laugh. If not, an awkward silence follows.
We can also just use the if
without specifying else
(make sure your volume is not at max when running this code and wearing a headphone).
We can also use the function ifelse
, which is a vectorized version of the if
/else
construct (Douglas et al. 2020).
FunnyJoke <- c(TRUE, FALSE)
CondTRUE <- "Hahaha =D"
CondFALSE <- "*Cough* awkward silence..."
ifelse(FunnyJoke, CondTRUE, CondFALSE)
[1] "Hahaha =D" "*Cough* awkward silence..."
When using the ifelse()
function, you have to make sure that you know what it does. ifelse(test, yes, no)
returns a vector of the same length as test
, with elements yes[i]
when test[i]
is TRUE
and no[i]
when test[i]
is FALSE
. This might sound confusing, so let’s just illustrate it with an example.
[1] "< 0" ">= 0" ">= 0"
We can also use multiple nested if
… else
statements.
ShellExec <- function(x) {
# replacement for shell.exe (doesn't exist on MAC)
if (exists("shell.exec",where = "package:base"))
return(base::shell.exec(x))
comm <- paste("open",x)
return(system(comm))
}
Samples <- 1:3
SampleVideo <- sample(Samples, 1, FALSE)
OpenVid <-
if(SampleVideo == 1) {
"https://www.youtube.com/watch?v=dQw4w9WgXcQ"
} else if(SampleVideo == 2) {
"https://www.youtube.com/watch?v=CsGYh8AacgY&t=12s"
} else {
"https://www.youtube.com/watch?v=1g9sneS2MF4"
}
ShellExec(OpenVid)
SampleVideo <- sample(Samples[!1:3 %in% SampleVideo], 1, FALSE)
8.4 Loops
When we have to perform a repetitive task, loops are the way to go. This repetitive task is then executed within each loop for a specified number of times or until a specific condition is met. Within R, there are three main types of loops, the for
loop, the while
loop and the repeat
loop @cite(introRbookdown).
8.4.1 The for loop
The for
loop is the most well known loop and also one of the most common loops within programming languages. The following example illustrates how for
loops work.
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In this example, we loop over the vector 1:5
and i
is used as an index or counter. The inner part of the for
loop is then repeated for every iteration. Breaking it down to the basics, we tell R to repeat the inner part (i.e. the code within the curly braces {}
) 5 times (length(1:5)
), taking subsequent values of the vector 1:5
in every loop and to keep track of the iteration with i
.
Hence, for the first iteration, i
gets the first value of the vector 1:5
which is 1
and runs the code print(i)
. In the second iteration, it gets the value 2
and so on.
The counter is not limited to i
, but basically every valid name can be used. Can you guess what the following output will be?
[1] 2
[1] 3
[1] 5
[1] 7
[1] 11
[1] 13
When working with loops, we have to be careful to not get stuck in an infinite loop. Even though this is more common in while
loops, this can also happen in for
loops.
For example, when we adjust the counter i
in the inner part, we get stuck in an infinite loop. One way to get out of this loop, is by using the break
statement.
[1] 1
[1] 1
[1] 1
Consequently, when using a for loop, make sure that you don’t adjust the counter variable i
that keeps track of the iterations! To make sure that everything goes as expected in the inner part of your for
loop, you can always run a test by first creating i <- 1
(or whatever value of the vector you are looping over) for example and then run the inner part.
To save the results of each iteration, we can basically use any data structure. A list is most often used, as its structure is quite intuitive to save the results of our loop. To save our results in this way, we create an empty list. Let’s work with the data base birthwt
from the package MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
bwt <- with(birthwt, {
race <- factor(race, labels = c("white", "black", "other"))
ptd <- factor(ptl > 0)
ftv <- factor(ftv)
levels(ftv)[-(1:2)] <- "2+"
data.frame(low = factor(low), age, lwt, race, smoke = (smoke > 0),
ptd, ht = (ht > 0), ui = (ui > 0), ftv)
})
Cols <- colnames(bwt)
DescrVars <- list()
for(Var in Cols) {
x <- bwt[, Var]
DescrVars[[Var]] <-
if(is.numeric(x)) {
mean(x)
} else {
table(x)
}
}
DescrVars
$low
x
0 1
130 59
$age
[1] 23.238
$lwt
[1] 129.81
$race
x
white black other
96 26 67
$smoke
x
FALSE TRUE
115 74
$ptd
x
FALSE TRUE
159 30
$ht
x
FALSE TRUE
177 12
$ui
x
FALSE TRUE
161 28
$ftv
x
0 1 2+
100 47 42
When using loops, it is also important to try to be as efficiently as possible by avoiding repetition within the loop. Try to avoid recomputing the same thing within the same loop. This is why we use x <- bwt[, Var]
in the above example and not
DescrVars <- list()
for(Var in Cols) {
DescrVars[[Var]] <-
if(is.numeric(bwt[, Var])) {
mean(bwt[, Var])
} else {
table(bwt[, Var])
}
}
By avoiding repetition, the code will run faster and this will be more important when working with large inner functions or large data sets. In addition, the chance of a coding error is also smaller in this case.
8.4.1.1 Nested for loops
We can also use a for
loop within another for
loop and that’s what we called nested for
loops. For example,
Alph = matrix(letters[1:4], nrow = 2)
for(j in 1:ncol(Alph)) {
for(i in 1:nrow(Alph)) {
cat(paste0("\nAlph[",i, ", ", j , "] = ", Alph[i, j], "\n"))
}
}
Alph[1, 1] = a
Alph[2, 1] = b
Alph[1, 2] = c
Alph[2, 2] = d
To save the results of nested loops, lists are the perfect type of object. In order to be able to save the results in an empty list (which does not have the nested structure yet), we have to use names when saving these in the slots of the list.
Results <- list()
for(j in seq_len(ncol(Alph))) {
for(i in seq_len(nrow(Alph))) {
Results[[paste0("Column", j)]][[paste0("Row", i)]] <- paste0("Alph[",i, ", ", j , "] = ", Alph[i, j])
}
}
Results
$Column1
$Column1$Row1
[1] "Alph[1, 1] = a"
$Column1$Row2
[1] "Alph[2, 1] = b"
$Column2
$Column2$Row1
[1] "Alph[1, 2] = c"
$Column2$Row2
[1] "Alph[2, 2] = d"
Alternatively, we can already specify the dimensions of the nested list using the following code.
Results <- list(vector("list", 2), vector("list", 2)) # Create an empty nested list
for(j in seq_len(ncol(Alph))) {
for(i in seq_len(nrow(Alph))) {
Results[[j]][[i]] <- paste0("Alph[",i, ", ", j , "] = ", Alph[i, j])
}
}
Results
[[1]]
[[1]][[1]]
[1] "Alph[1, 1] = a"
[[1]][[2]]
[1] "Alph[2, 1] = b"
[[2]]
[[2]][[1]]
[1] "Alph[1, 2] = c"
[[2]][[2]]
[1] "Alph[2, 2] = d"
8.4.2 The while loop
With a while
loop, you repeat the inner part until some logical condition is met.
Take, for example, that we have 64 tasks on our to-do list and we want to get at least 5 tasks done for today.
10 tasks left
9 tasks left
8 tasks left
7 tasks left
6 tasks left
[1] 5
Another example is the following.
[1] FALSE
As with for
loops, we can also use the break
statement to stop the loop.
[1] 8
The example below puts many things together (taken from DataCamp’s `Intermediate R’ course).
[1] 3
[1] 6
[1] 9
[1] 12
[1] 15
[1] 18
[1] 21
[1] 24
The repeat
loop is similar to the while
loop, but has no conditional check (Douglas et al. 2020). So in this case, you have to make sure that you don’t get stuck in an infinite loop by building in a break
statement.
i <- 0
repeat {
print("Eat")
print("Sleep")
print("Rave")
print("Repeat")
i <- i + 1
if(i >= 3)
break
}
[1] "Eat"
[1] "Sleep"
[1] "Rave"
[1] "Repeat"
[1] "Eat"
[1] "Sleep"
[1] "Rave"
[1] "Repeat"
[1] "Eat"
[1] "Sleep"
[1] "Rave"
[1] "Repeat"
8.5 Functions in R
The book ‘An Introduction to R’ (Douglas et al. 2020) gives a splendid (and funny) description of functions:
Functions are your loyal servants, waiting patiently to do your bidding to the best of their ability. They’re made with the utmost care and attention … though sometimes may end up being something of a Frankenstein’s monster - with an extra limb or two and a head put on backwards. But no matter how ugly they may be, they’re completely faithful to you.
They’re also very stupid.
8.5.1 Using a function
Throughout this book, we have already been using functions, but what exactly is a function? Let’s start from a classic one, the mean()
function.
function (x, ...)
NULL
Basically, we give input to this function and it will return the arithmetic mean. In mean(x, trim = 0, na.rm = FALSE, ...)
x
is required; if you do not specify it, R will throw an error. This is a good thing, because you of course need to tell R of what object you want to compute the mean. Here, x
is an argument of the function mean()
and with mean(x = a)
, you tell R to compute the mean of the object a
. trim
and na.rm
are optional arguments: they have a default value which is used if the arguments are not explicitly specified. [Quote from DataCamp’s ‘Intermediate R’ course.]
You will now use the mean
function as follows
[1] NA
[1] 6.25
Functions return objects that can be used elsewhere. As such, you can use a function within function.
[1] 2 6 4 NA NA
[1] 4
8.5.2 Write your own function
The ability to use self-written functions makes R incredibly powerful, efficient, convenient and elegant. Once you learn how to write your own functions, programming with R will be even more comfortable and productive (Venables, Smith, and R Core Team 2020).
Creating a function in R is basically the assignment of a function object to a variable. That’s why you will use the assignment operator <-
. The basic form of a function is given by (slightly altered example and text from (Douglas et al. 2020)):
With this code, we are telling R that we want to create an object of type function and that we want to give it the name NameFunction
. As input, it takes Argument1
and Argument2
and it then performs the inner part of the function which is denoted by Expression
.
As always, it’s easier to comprehend this with an example.
[1] 3.4641
[1] 4
[1] 5
You can define default argument values in your own R functions as well. Here you see an example.
my_sqrt <- function(x, print_info = TRUE) {
y <- sqrt(x)
if (print_info) {
print(paste("sqrt", x, "equals", y))
}
return(y)
}
# some calls of the function
my_sqrt(16)
[1] "sqrt 16 equals 4"
[1] 4
[1] 4
[1] "sqrt 16 equals 4"
[1] 4
R works in a vectorized way and most of R’s functions are vectorized. This means that the function will operate on all elements of a vector. Check this by calling the function my_sqrt
on an input vector.
[1] "sqrt 16 equals 4" "sqrt 25 equals 5" "sqrt 36 equals 6"
[1] 4 5 6
With this simple self-written function, we can also illustrated why functions can be called ‘stupid’ and why we should always include checks for the arguments of your self-written function. With regard to this issue, (Douglas et al. 2020) gives an accurate description of what it means to be a programmer:
Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.
Warning in sqrt(x): NaNs produced
[1] "sqrt -1 equals NaN"
[1] NaN
We can fix this issue by adding some checks to our self-written function. It’s up to you to decide on the checks that you add to your function and also what action will be taken. With the function stop()
the code will be stopped and with the function warning()
the function will proceed, but a warning will be given when using the function.
my_sqrt <- function(x, print_info = TRUE) {
if(!is.complex(x)) {
if(x < 0) {
stop("Computer says no.")
}
}
y <- sqrt(x)
if (print_info) {
print(paste("sqrt", x, "equals", y))
}
return(y)
}
my_sqrt(-1)
Error in my_sqrt(-1): Computer says no.
[1] "sqrt -9+0i equals 0+3i"
[1] 0+3i
The example also illustrates that you can specify the error/warning message yourself. Even if it’s not the case for this example, make sure that these messages are informative. Both for the user and yourself. This way, the user will know what went wrong and why the function doesn’t work. This will save you a lot of time in the future, when you will use code that you have written a long time ago.
8.5.3 The ‘…’ argument
You probably already encountered the ellipsis argument ...
in the help files of some functions (see ?plot
for example). This argument allows you to pass the arguments of your self-written function to another function (within your function).
8.5.4 Function environments (advanced)
This part is a bit more advanced, so don’t worry if you don’t fully understand everything in this section. Know that it suffices to have a general understanding of this chapter and that you can read it again once you are more familiar with writing functions. For an in-depth discussion on (function) environments, we refer the reader to the Advanced R book (Wickham 2019).
Remember the sandbox analogy? Our global environment is our big sandbox and the objects that we create in this sandbox, stay in it. When we define a function and give it a name, we create a new object in our sandbox. Just like any other object, we can call it by using its name. Keeping the sandbox analogy, we can look at a function as a mini-sandbox within our larger sandbox, just like other objects. Functions, however, are fundamentally different from other objects. Not only because of their class, but also because of how they operate and their interaction with the different environments in R.
The inner part of the function (i.e. the part in the curly braces {}
) has its own environment, called the execution environment. Further, functions have what they call a fresh start principle (Wickham 2019). This means that, every time we run a function, a new execution environment is created in which the inner part is executed. Translating this to the sandbox analogy, we start with a clean mini-sandbox every time we run the function and after the function has been completed, we throw away everything that was computed and created in this mini-sandbox.
environment:<environment: 0x0000020c93e948e0>
[1] 2
environment:<environment: 0x0000020c93e47350>
[1] 2
In this example we see that a new (execution) environment is created every time the function is called. Due to this fresh start principle, R makes sure that results are not contaminated due to previous function calls and that we get the same result when we run the function with the same arguments.
Error in f(2): object 'a' not found
[1] 6
8.6 The apply family
Whenever you’re using a for loop, you might want to revise your code and see whether you can use a member of the apply
family instead. [Quote from DataCamp’s `Intermediate R’ course]
These functions may be a bit confusing at first, but once you get the hang of them they will be your best friends. With this family of functions, we are able to avoid loops, to run our code faster (certainly when using the parallel versions) and the risk of having an error is considerably lower. As (Douglas et al. 2020) accurately and honestly put it: If you can, use the apply version. There’s nothing worse than realizing there was a small, tiny, seemingly meaningless mistake in a loop which weeks, months or years down the line has propagated into a huge mess. We strongly recommend trying to use the apply functions whenever possible.
Just look at the following example which is a simplification of what might happen in reality. This all has to do with our (execution) environment. In the for
loop this is our global environment and that’s why the original x
gets overwritten. Conversely, when using sapply()
we have the fresh start principle. Every ‘loop’ within sapply
starts with a ‘clean’ environment.
[1] 4
[1] 6
[1] 9
[1] 4 5 6
The sapply
function is a member of a broader group of functions, the apply family of functions:
apply
,lapply
,sapply
,tapply
,vapply
,mapply
.
Of all members, the sapply
and lapply
function are easiest to start with. The sapply
and lapply
function are nearly identical and are meant for one-dimensional inputs (i.e. a vector or a list). sapply
will try to simplify the output as much as possible, while lapply
will always return a list. Based on the example in the code block above, can you guess what the output of the following code will be?
To explain the mechanics behind lapply
more in detail, let’s use the following structure:
Here, X
is the input which is either a vector or a list and each element of X
is then passed to the function FUN
.
Garfield <- function(x) {
if(x == "Monday")
"I hate Mondays!"
else
"I'm glad it's not Monday."
}
X <- weekdays(Sys.Date() + 0:6)
lapply(X, FUN = Garfield)
[[1]]
[1] "I'm glad it's not Monday."
[[2]]
[1] "I'm glad it's not Monday."
[[3]]
[1] "I'm glad it's not Monday."
[[4]]
[1] "I'm glad it's not Monday."
[[5]]
[1] "I'm glad it's not Monday."
[[6]]
[1] "I hate Mondays!"
[[7]]
[1] "I'm glad it's not Monday."
In this example, X
is a vector that contains all days of the week and each weekday is then passed to the function Garfield
which tells you whether Garfield is happy or not. Further, the results are returned in a list-form. This is another advantage compared to a for
loop, where we would have to explicitly create a list and save all results in this list. In addition, the fact that the input to the lapply
function can also be a list, makes it even more interesting to use. The lapply
and sapply
function are incredibly powerful once you get the hang of it. Just take a look at the following example.
p <- 4
n <- 10
X <- cbind(1, replicate(p, rnorm(n)))
B <- runif(1 + p)
Y <- X %*% B + rnorm(n) / 10
Df <- data.frame(Y, X[, -1])
ModelFits <- lapply(1:p, function(x) {
Formula <- formula(paste0("Y ~ ", paste0("X", 1:x, collapse = " + ")))
lmFit <- lm(Formula, data = Df)
return(lmFit)
})
lapply(ModelFits, coef)
[[1]]
(Intercept) X1
0.24627 0.75615
[[2]]
(Intercept) X1 X2
0.53405 0.68925 0.53393
[[3]]
(Intercept) X1 X2 X3
0.55575 0.40982 0.65191 0.56139
[[4]]
(Intercept) X1 X2 X3 X4
0.56549 0.69207 0.51405 0.26160 0.60735
[[1]]
[1] 1.3588
[[2]]
[1] 1.2094
[[3]]
[1] 1.1741
[[4]]
[1] 0.09184
If you look at the help file of the lapply
function, you will notice that this also makes use of the ellipsis ...
argument and hence, we can pass arguments to the function FUN
.
[[1]]
fit lwr upr
1 -0.25841 -1.45855 0.94174
2 1.48189 -0.47391 3.43768
3 -1.43837 -3.92390 1.04716
4 0.25196 -0.73908 1.24299
5 0.54358 -0.52956 1.61672
6 0.02281 -1.01113 1.05675
7 0.73687 -0.46180 1.93554
8 0.04378 -0.98240 1.06996
9 0.23757 -0.75334 1.22848
10 0.78031 -0.45251 2.01314
[[2]]
fit lwr upr
1 -0.73429 -2.00250 0.53393
2 2.39866 0.23009 4.56723
3 -0.98511 -3.33386 1.36363
4 -1.19726 -3.34381 0.94929
5 0.72918 -0.28147 1.73983
6 0.61460 -0.61925 1.84844
7 0.59655 -0.51355 1.70665
8 0.33777 -0.67864 1.35418
9 0.59669 -0.42829 1.62167
10 0.04521 -1.45179 1.54221
[[3]]
fit lwr upr
1 -1.00645 -2.397046 0.38415
2 2.22602 0.019034 4.43301
3 -0.97028 -3.329953 1.38940
4 -0.92906 -3.154274 1.29616
5 1.19403 -0.197737 2.58580
6 0.87573 -0.474200 2.22566
7 0.52571 -0.598860 1.65028
8 -0.62319 -2.840282 1.59389
9 1.31519 -0.480728 3.11110
10 -0.20571 -1.794928 1.38350
[[4]]
fit lwr upr
1 -2.883295 -3.075433 -2.691158
2 2.757025 2.570474 2.943575
3 0.258924 0.040215 0.477633
4 -0.216281 -0.408317 -0.024245
5 0.329663 0.194976 0.464350
6 -0.060252 -0.195305 0.074801
7 1.373749 1.257944 1.489555
8 -0.621892 -0.804081 -0.439702
9 1.488533 1.340266 1.636801
10 -0.024186 -0.155632 0.107260
If we would rather have the results returned as a vector, matrix or array, we can make use of the sapply
function.
[1] 1.35885 1.20943 1.17410 0.09184
Another advantage of using sapply
is that it will, by default, return a named object when the input X
has a names
attribute.
$names
[1] "Model1" "Model2" "Model3" "Model4"
Model1 Model2 Model3 Model4
1.35885 1.20943 1.17410 0.09184
This default behavior of simplifying the output can be changed by setting the argument simplify = FALSE
.
$Model1
[1] 1.3588
$Model2
[1] 1.2094
$Model3
[1] 1.1741
$Model4
[1] 0.09184
Another very famous member of this family is the apply
function and is meant for two- or higher-dimensional objects.
- It must be applied on an array and is also applicable to a matrix and a data.frame.
- It takes the following arguments:
- first argument: the array you are working with
- second argument: margin to apply the function over (1 for rows and 2 for columns)
- third argument: function you want to apply.
- fourth argument: ellipsis argument, used to pass arguments to the applied function
Let’s start by illustrating it with a very simple example.
[1] 12 15 18
[1] 6 15 24
[1] 12 NA 18
[1] 12 13 18
What apply
does, is that is basically ‘slices’ the input object according to the margin specified and then passes each ‘slice’ to the user-specified function. That’s why it’s related to the lapply
and sapply
function. Do you know why we get the same result for apply
and sapply
in the following block of code?
V1 V2 V3
4 15 24
V1 V2 V3
4 15 24
That’s right! Since a data.frame is a list
type of object, we can either use apply
or lapply
on a data.frame when we want to use the columns as the input for FUN
.
Another useful member of this family is the tapply
function, which you already encountered in Chapter 4. The tapply
function is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form.
wages <- c(5500, 3500, 6500, 7500)
gender <- c("F", "F", "M", "M")
region <- c("North", "South", "North", "South")
salary <- data.frame(wages, gender, region)
tapply(salary$wages, salary$gender, mean)
F M
4500 7000
North South
F 5500 3500
M 6500 7500
The remaining two members, vapply
and mapply
, are for more advanced users and hence, we will not discuss them in detail. For now, it suffices to know that vapply
is similar to sapply
, but that it has a pre-specified type of return value and that mapply
is a multivariate version of sapply
.
8.6.1 Taking lapply/sapply to the next level
We saw that the input for sapply
/lapply
can also be a list. What if the list contained functions? Would this really work? In general, no matter how crazy your idea is for R, if you can translate it to code it will work.
[[1]]
[1] 4.973
[[2]]
[1] 2.2368
[[3]]
[1] 5
[[4]]
[1] 3
This example again shows how flexible and awesome R is!
8.7 Exercises
Learning check
Create a function that will return the sum of 2 integers.
The function
var
in R calculates the unbiased variance estimator, given a random sample of data. Write a functionvariance
which returns the biased or unbiased estimate of the variance, depending on the value of the argumentbias
which can beTRUE
orFALSE
. By default the functionvariance
should produce the same result asvar
. Formulas: unbiased = \(\frac{1}{n-1}\sum_i (x_i-\bar{x})^2\) and biased = \(\frac{1}{n}\sum_i(x_i-\bar{x})^2\) where \(\bar{x}=\frac{1}{n} \sum_i x_i\).Create a function that given a vector and an integer will return how many times the integer appears inside the vector.
Create a function that given a vector will print by screen the mean and the standard deviation, it will optionally also print the median.
Start by running the code
Avg <- ceiling(runif(5) * 10)
.
- Use
lapply
with the vectorAvg
and the functionrnorm
withn = 1000
to create a list. Each element ofAvg
should be passed to the argumentmean
of the functionrnorm
. - Compute the mean of each element in this list using either
lapply
orsapply
. - Compute the standard deviation as well.