6 Data wrangling in R
For advanced, and fast, data handling with large R objects and lots of flexibility, two lines of work are available:
- the RStudio line (with Hadley Wickham) offering the packages from the
tidyverse
: see tidyverse - the
data.table
line developed by Matt Dowle, see e.g. DataCamp’s course ondata.table
.
Both have a very specific syntax, with a demanding learning curve.
6.1 Ideas from the tidyverse
6.1.1 A tibble
instead of a data.frame
Within the tidyverse
tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). (Quote from tibble vignette) You can use tibble
to create a new tibble and as_tibble
transforms an object (e.g. a data frame) into a tibble
.
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
6.1.2 Pipes in R
Read the story behind the pipe operator in R in this tutorial from DataCamp pipes in R. In R, the pipe operator is %>%
. You can think of this operator as being similar to the +
in a ggplot2
statement, as introduced in Chapter 5. It takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.
6.1.3 Filter observations using filter
Here is a first example of using the pipe in R.
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# A tibble: 21,551 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
# ... with 21,541 more rows
The code chunk above will translate to something like “you take the diamonds
data, then you subset the data”.
This is one of the most powerful things about the tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”.
Here is another example where you now filter diamonds based on two characteristics.
# A tibble: 3,903 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.26 Ideal E VVS2 62.9 58 554 4.02 4.06 2.54
3 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
4 0.59 Ideal E VVS2 62 55 2761 5.38 5.43 3.35
5 0.74 Ideal E SI2 62.2 56 2761 5.8 5.84 3.62
6 0.7 Ideal E VS2 60.7 58 2762 5.73 5.76 3.49
7 0.74 Ideal E SI1 62.3 54 2762 5.8 5.83 3.62
8 0.7 Ideal E SI1 60.9 57 2768 5.73 5.76 3.5
9 0.6 Ideal E VS1 61.7 55 2774 5.41 5.44 3.35
10 0.7 Ideal E SI1 62.7 55 2774 5.68 5.74 3.58
# ... with 3,893 more rows
# A tibble: 6,737 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
3 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
4 0.26 Ideal E VVS2 62.9 58 554 4.02 4.06 2.54
5 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
6 0.59 Ideal E VVS2 62 55 2761 5.38 5.43 3.35
7 0.74 Ideal E SI2 62.2 56 2761 5.8 5.84 3.62
8 0.7 Ideal E VS2 60.7 58 2762 5.73 5.76 3.49
9 0.71 Ideal D SI2 62.3 56 2762 5.73 5.69 3.56
10 0.74 Ideal E SI1 62.3 54 2762 5.8 5.83 3.62
# ... with 6,727 more rows
6.1.4 Summarize variables using summarize
The code chunk below will translate to something like “you take the diamonds
data, then you subset the data and then you calculate mean and standard deviation of these data”.
# A tibble: 1 x 2
mean std_dev
<dbl> <dbl>
1 3458. 3808.
6.1.5 Summarize based on groupings of another variable
So, here is what you’d like to do.
cut price
1 Fair 4358.8
2 Good 3928.9
3 Very Good 3981.6
4 Premium 4584.8
5 Ideal 3457.5
How can you do this with the pipe?
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
cut mean
<ord> <dbl>
1 Fair 4359.
2 Good 3929.
3 Very Good NA
4 Premium NA
5 Ideal 3458.
Now you want to group by multiple variables.
`summarise()` regrouping output by 'cut' (override with `.groups` argument)
# A tibble: 35 x 3
# Groups: cut [5]
cut color price
<ord> <ord> <dbl>
1 Fair D 4291.
2 Fair E 3682.
3 Fair F 3827.
4 Fair G 4239.
5 Fair H 5136.
6 Fair I 4685.
7 Fair J 4976.
8 Good D 3405.
9 Good E 3424.
10 Good F 3496.
# ... with 25 more rows
Now you want to calculate multiple metrics.
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 3
cut price carat
<ord> <dbl> <dbl>
1 Fair 4359. 1.05
2 Good 3929. 0.849
3 Very Good NA 0.806
4 Premium NA 0.892
5 Ideal 3458. 0.703
And finally, multiple metrics and multiple grouping variables.
`summarise()` regrouping output by 'cut' (override with `.groups` argument)
# A tibble: 35 x 4
# Groups: cut [5]
cut color price carat
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 0.920
2 Fair E 3682. 0.857
3 Fair F 3827. 0.905
4 Fair G 4239. 1.02
5 Fair H 5136. 1.22
6 Fair I 4685. 1.20
7 Fair J 4976. 1.34
8 Good D 3405. 0.745
9 Good E 3424. 0.745
10 Good F 3496. 0.776
# ... with 25 more rows
6.1.6 Joining tibbles
Now you want to add the mean price and mean carat per cut
to the original tibble. You use the variable cut
as the key to identify observations.
`summarise()` ungrouping output (override with `.groups` argument)
6.2 Data science the data.table
way
6.2.1 Speed junkies love data.table
data.table
is a package designed for speed junkies. “The R data.table
package is rapidly making its name as the number one choice for handling large datasets in R.” It extends and exchanges the functionality of the basic data.frame
in R. The syntax is different and you’ll have to get used to it. A data.table
cheat sheet is available here.
6.2.2 What is a data.table
?
Here you see some basic illustrations with the diamonds
data.
data.table 1.13.2 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
carat cut color clarity depth table price x y z
1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
---
53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
6.2.3 Identify keys
Instead of using subset
from the base
R, you will use the setkey
to extract the observations you want to have.
# key is used to index the data.table and will provide the extra speed
setkey(diamonds_DT, cut)
tables()
NAME NROW NCOL MB COLS KEY
1: diamonds_DT 53,940 10 3 carat,cut,color,clarity,depth,table,... cut
Total: 3MB
carat cut color clarity depth table price x y z
1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
3: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
4: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
5: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
---
21547: 0.79 Ideal I SI1 61.6 56 2756 5.95 5.97 3.67
21548: 0.71 Ideal E SI1 61.9 56 2756 5.71 5.73 3.54
21549: 0.71 Ideal G VS1 61.4 56 2756 5.76 5.73 3.53
21550: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
21551: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
NAME NROW NCOL MB COLS KEY
1: diamonds_DT 53,940 10 3 carat,cut,color,clarity,depth,table,... cut,color
Total: 3MB
carat cut color clarity depth table price x y z
1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2: 0.26 Ideal E VVS2 62.9 58 554 4.02 4.06 2.54
3: 0.70 Ideal E SI1 62.5 57 2757 5.70 5.72 3.57
4: 0.59 Ideal E VVS2 62.0 55 2761 5.38 5.43 3.35
5: 0.74 Ideal E SI2 62.2 56 2761 5.80 5.84 3.62
---
3899: 0.70 Ideal E SI1 61.7 55 2745 5.71 5.74 3.53
3900: 0.51 Ideal E VVS1 61.9 54 2745 5.17 5.11 3.18
3901: 0.56 Ideal E VVS1 62.1 56 2750 5.28 5.29 3.28
3902: 0.77 Ideal E SI2 62.1 56 2753 5.84 5.86 3.63
3903: 0.71 Ideal E SI1 61.9 56 2756 5.71 5.73 3.54
carat cut color clarity depth table price x y z
1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2: 0.26 Ideal E VVS2 62.9 58 554 4.02 4.06 2.54
3: 0.70 Ideal E SI1 62.5 57 2757 5.70 5.72 3.57
4: 0.59 Ideal E VVS2 62.0 55 2761 5.38 5.43 3.35
5: 0.74 Ideal E SI2 62.2 56 2761 5.80 5.84 3.62
---
6733: 0.51 Ideal D VVS2 61.7 56 2742 5.16 5.14 3.18
6734: 0.51 Ideal D VVS2 61.3 57 2742 5.17 5.14 3.16
6735: 0.81 Ideal D SI1 61.5 57 2748 6.00 6.03 3.70
6736: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
6737: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
# what would be the alternative with base R?
subset(diamonds, diamonds$cut == "Ideal" && diamonds$color == c("E", "D"))
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
6.2.4 Alternative and faster ways to aggregate
Instead of using aggregate
from the base
R, you will identify the by
variable(s).
cut price
1 Fair 4358.8
2 Good 3928.9
3 Very Good 3981.6
4 Premium 4584.8
5 Ideal 3457.5
user system elapsed
0.02 0.00 0.02
# aggregation with data.table
# will go faster thanks to indexing
diamonds_DT[ , mean(price), by=cut]
cut V1
1: Fair 4358.8
2: Good 3928.9
3: Very Good NA
4: Premium NA
5: Ideal 3457.5
user system elapsed
0 0 0
cut price
1: Fair 4358.8
2: Good 3928.9
3: Very Good NA
4: Premium NA
5: Ideal 3457.5
cut color V1
1: Fair D 4291.1
2: Fair E 3682.3
3: Fair F 3827.0
4: Fair G 4239.3
5: Fair H 5135.7
6: Fair I 4685.4
7: Fair J 4975.7
8: Good D 3405.4
9: Good E 3423.6
10: Good F 3495.8
11: Good G 4123.5
12: Good H 4276.3
13: Good I 5078.5
14: Good J 4574.2
15: Very Good D 3470.5
16: Very Good E 3214.7
17: Very Good F NA
18: Very Good G 3872.8
19: Very Good H 4535.4
20: Very Good I 5255.9
21: Very Good J 5103.5
22: Premium D 3631.3
23: Premium E NA
24: Premium F NA
25: Premium G 4500.7
26: Premium H 5216.7
27: Premium I 5946.2
28: Premium J 6294.6
29: Ideal D 2629.1
30: Ideal E 2597.6
31: Ideal F 3374.9
32: Ideal G 3720.7
33: Ideal H 3889.3
34: Ideal I 4452.0
35: Ideal J 4918.2
cut color V1
# aggregate multiple arguments
diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = cut]
cut price carat
1: Fair 4358.8 1.04614
2: Good 3928.9 0.84918
3: Very Good NA 0.80638
4: Premium NA 0.89195
5: Ideal 3457.5 0.70284
cut price carat caratSum
1: Fair 4358.8 1.04614 1684.3
2: Good 3928.9 0.84918 4166.1
3: Very Good NA 0.80638 9742.7
4: Premium NA 0.89195 12301.0
5: Ideal 3457.5 0.70284 15146.8
# multiple metrics and multiple grouping variables
diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = list(cut, color)]
cut color price carat
1: Fair D 4291.1 0.92012
2: Fair E 3682.3 0.85661
3: Fair F 3827.0 0.90471
4: Fair G 4239.3 1.02382
5: Fair H 5135.7 1.21917
6: Fair I 4685.4 1.19806
7: Fair J 4975.7 1.34118
8: Good D 3405.4 0.74452
9: Good E 3423.6 0.74513
10: Good F 3495.8 0.77593
11: Good G 4123.5 0.85090
12: Good H 4276.3 0.91473
13: Good I 5078.5 1.05722
14: Good J 4574.2 1.09954
15: Very Good D 3470.5 0.69642
16: Very Good E 3214.7 0.67632
17: Very Good F NA 0.74096
18: Very Good G 3872.8 0.76680
19: Very Good H 4535.4 0.91595
20: Very Good I 5255.9 1.04695
21: Very Good J 5103.5 1.13322
22: Premium D 3631.3 0.72155
23: Premium E NA 0.71774
24: Premium F NA 0.82704
25: Premium G 4500.7 0.84149
26: Premium H 5216.7 1.01645
27: Premium I 5946.2 1.14494
28: Premium J 6294.6 1.29309
29: Ideal D 2629.1 0.56577
30: Ideal E 2597.6 0.57840
31: Ideal F 3374.9 0.65583
32: Ideal G 3720.7 0.70071
33: Ideal H 3889.3 0.79952
34: Ideal I 4452.0 0.91303
35: Ideal J 4918.2 1.06359
cut color price carat
6.2.5 Joining data.tables
How to join data.tables
?
# join two data.tables
d <- diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = cut]
d
cut price carat
1: Fair 4358.8 1.04614
2: Good 3928.9 0.84918
3: Very Good NA 0.80638
4: Premium NA 0.89195
5: Ideal 3457.5 0.70284
carat cut color clarity depth table price x y z i.price
1: 0.75 Fair D SI2 64.6 57 2848 5.74 5.72 3.70 4358.8
2: 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 4358.8
3: 0.90 Fair D SI2 66.9 57 2885 6.02 5.90 3.99 4358.8
4: 1.00 Fair D SI2 69.3 58 2974 5.96 5.87 4.10 4358.8
5: 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 4358.8
---
53936: 0.71 Ideal J SI1 60.6 57 2700 5.78 5.83 3.52 3457.5
53937: 0.81 Ideal J VS2 62.1 56 2708 5.92 5.97 3.69 3457.5
53938: 0.84 Ideal J VS2 61.1 57 2709 6.09 6.12 3.73 3457.5
53939: 0.82 Ideal J VS2 61.6 56 2741 6.00 6.04 3.71 3457.5
53940: 0.83 Ideal J VS2 62.3 55 2742 6.01 6.03 3.75 3457.5
i.carat
1: 1.04614
2: 1.04614
3: 1.04614
4: 1.04614
5: 1.04614
---
53936: 0.70284
53937: 0.70284
53938: 0.70284
53939: 0.70284
53940: 0.70284
6.3 Exercises
Learning check
- (An exercise taken from (Kleiber and Zeileis 2008)) “PARADE” is the Sunday newspaper magazine supplementing the Sunday or weekend edition of some 500 daily newspapers in the United States of America. An important
yearly feature is an article providing information on some 120150 “randomly” selected
US citizens, indicating their profession, hometown and state, and their yearly earnings.
The Parade2005 (in library AER) data contain the 2005 version, amended by a variable
indicating celebrity status (motivated by substantial oversampling of celebrities in these data).
For the Parade2005 data and by using
%>%
answer the following questions.
- Load the data
Parade2005
from theAER
package, usedata("Parade2005")
to make the data accessible. - Determine the mean earnings in California.
- Determine the number of individuals residing in Idaho.
- Determine the mean and the median earnings of celebrities.