2 Data exploration
In this tutorial you import a dataset and perform an exploratory analysis of the variables. You will visualize the data with ggplot and calculate basic statistics such as the expected value and variance of each variable.
2.1 Importing data in R
The data for this tutorial is stored in the file CPS1985.txt, which is stored in a local subdirectory data
. Check out chapter 4 of Data science in insurance: an R intro for a detailed overview of the R methods for importing data.
2.1.1 Determining the file path
Before we can import the data, we have to identify the file location. The file location can be specified by an absolute path or a relative path.
absolute_path <- "C:\\Users\\u0095171\\Dropbox\\Verzekeringen niet-leven\\Bookdown\\data\\CPS1985.txt"
relative_path <- "data\\CPS1985.txt"
Directories in the file path can be separated by a forward slash (/
) or a double backward slash (\\
).
Relative paths in R always start from your working directory. The command getwd()
returns your current working directory.
[1] "C:/Users/u0095171/Dropbox/Research AFI/R software/Bookdown/LossModels"
setwd()
specifies the working directory for the current R session. Once set, all files in the working directory can be referenced by relative paths.
# This is the map containing all files for the tutorial
setwd("C:\\Users\\u0095171\\Dropbox\\Verzekeringen niet-leven\\Bookdown\\")
# This is a relative path from the working directory to the file we want to import
path <- "data\\CPS1985.txt"
To make sure that the file is located in this directory, we can check it using the command file.exists()
.
[1] TRUE
It is often convenient to set the working directory to the location of the active R file. In Rstudio the following code retrieves the directory of the active document.
2.1.2 Organizing your analysis
CheckDir <- function(x) if(!dir.exists(x)) dir.create(x)
pathDir = path.expand("~/AnalysisLossModels/")
pathData = paste0(pathDir, "Data/")
pathFigs = paste0(pathDir, "Figures/")
pathRes = paste0(pathDir, "Results/")
sapply(list(pathDir, pathData, pathFigs, pathRes), CheckDir)
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[1] FALSE
[1] "Data" "Figures" "Results"
[1] "Data/CPS1985.txt"
2.1.3 Import a .txt file
read.table()
is the most basic import function in R.
You can specify tons of different arguments in this function (See ?read.table
).
We import the dataset CPS1985.txt
in a variable CPS
.
After importing a dataset, you should always inspect the data to confirm that the data was imported correctly. head
prints the first records of the dataset.
wage education experience age ethnicity region gender occupation
1 5.10 8 21 35 hispanic other female worker
2 6.67 12 1 19 cauc other male worker
3 4.00 12 4 22 cauc other male worker
4 7.50 12 17 35 cauc other male worker
5 13.07 13 9 28 cauc other male worker
6 4.45 10 27 43 cauc south male worker
sector union married
1 manufacturing no yes
2 manufacturing no no
3 other no no
4 other no yes
5 other yes no
6 other no no
Some more data characteristics:
[1] 533
[1] "wage" "education" "experience" "age" "ethnicity"
[6] "region" "gender" "occupation" "sector" "union"
[11] "married"
2.2 Data exploration
We continue to explore the variables in the CPS
dataset with a focus on wage
and sector
.
2.2.1 variable types
The same data can often be stored in different formats. The function class
returns the type of an R object.
Use the function class
to
- Determine the class of the dataset CPS;
- Determine the class of the variables
wage
andsector
in the dataset CPS.
[1] "data.frame"
[1] "numeric"
[1] "character"
Within the tidyverse
the class of all variables in a dataset can be determined using map_chr(<dataset>, class)
.
The function map(<dataset>, <function>)
applies <function>
to each of the columns of <dataset>
. We can use map_chr
when the output of <function>
is of type character
.
wage education experience age ethnicity region
"numeric" "integer" "integer" "integer" "character" "character"
gender occupation sector union married
"character" "character" "character" "character" "character"
2.2.2 Factor variables
factor
is the R-datatype for unordened categorical variables.
Some usefull functions for factor
variables are:
levels
: shows the possible outcomes.table
: shows the frequency of the possible outcomes.
NULL
construction manufacturing other
24 98 411
Factors are type-safe, that is you will get a warning when you try to insert a value that is not in levels(<factor variable>)
.
[1] "insurance" "manufacturing" "other" "other"
[5] "other" "other"
test
construction insurance manufacturing other
24 1 97 411
NA
values are an indication that some calculation went wrong. Notice that the NA
value is not shown in the output of table
. To obtain the number of NA
observation use sum(is.na(<dataset>))
.
[1] 0
To insert a new value in a factor
variable, we first adapt the set of allowed outcomes.
test
construction insurance manufacturing other
24 1 97 411
test
construction insurance manufacturing other
24 1 97 411
The following code applies table
to all columns of the dataset CPS
that are of class factor.
require(tidyverse)
CPS %>%
keep(is.factor) %>% # Select all columns of class factor
map(table) # apply the table function to these columns
named list()
In ggplot
, geom_bar
is used to construct barplots. Check out the online documentation for examples.
Adding fill = <second factor variable>
visualizes the interaction between two factor variables.
The default behavior is to stack bars. Adding position = position_dodge()
shows bars of different color side-by-side.
The following code visualizes the distribution of sector by gender.
2.2.3 Numeric variables
The variable wage
in the dataset CPS
is numeric.
[1] "numeric"
A good strategy is to start by visualizing numeric variables in a density plot.
Adding theme_bw()
selects the black and white theme for ggplot figures.
The density shows that most wages are between 0 and 30, but there is an outlier with a wage of more than 40.
stat_cdf
visualizes the empirical cdf.
mean
and sd
calculate the empirical mean and standard deviation of a numeric variable.
[1] 9.032 5.141
You can also calulate the mean and standard deviation using the formulas \[\mu = E(X) = \frac{1}{n}\sum_{i=1}^n x\] and \[\sigma = \sqrt{E((X-\mu)^2 } = \sqrt{\frac{1}{n}\sum_{i=1}^n (x-\mu)^2}\]
[1] 9.032
[1] 5.136
There is a slight difference between our outcome and the result of sd(.)
, since the latter divides by n-1
instead of n
.
We test whether the normal distribution is a good approximation for CPS$wage
. For this we visualize the density of CPS$wage
and its normal approximation in a single graph.
In order to visualize the normal approximation we simulate 1000 observations from a normal distribution with the same mean and standard deviation as CPS$wage
.
rnorm
generates observations from a normal distribution.
ggplot() +
theme_bw() +
geom_density(aes(CPS$wage, fill = 'original data'), alpha = .5) +
geom_density(aes(normal_approx, fill = 'normal approximation'), alpha = .5)
The shape of these densities is quite different and we conclude that the normal distribution is not a good approximation for CPS$wage
. One of the reasons for this poor approximation is that the normal distribution is defined on \([-\infty, \infty]\), whereas wage has to be positive.
We transform the outcomes of CPS$wage
by taking the logarithm.
Exercise:
Check whether the normal distribution is a good approximation for log(CPS$wage)
.
mu = mean(log_wage);
sigma = sd(log_wage);
normal_approx <- rnorm(1000, mean = mu, sd = sigma)
ggplot() +
theme_bw() +
geom_density(aes(log(CPS$wage), fill = 'log_wage'), alpha = .5) +
geom_density(aes(normal_approx, fill = 'normal approximation'), alpha = .5)
The approximation is better, but the normal distribution has fatter tails than the log(wage) distribution.