-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgrouping.qmd
More file actions
110 lines (87 loc) · 3 KB
/
grouping.qmd
File metadata and controls
110 lines (87 loc) · 3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: "Grouping"
format: html
editor: source
---
```{r, include = FALSE}
source("setup.R")
```
A very common task is to split a data frame into subgroups, operate somehow on each group, and then bind the results back into one data frame: `split-do-bind`.
Let's do something simple with the flight data: split into groupings by `origin` and `carrier` and compute the mean `dep_delay`.
## data.frame
The classic way to do this involves the `split`, `lapply` and `rbind` (with help from do.call) functions. To split onto two groups we need to make a charcater (or factor) vector with compound meaning, `origin_carrier`, and split on that.
```{r}
df = read_flights(type = "data.frame")
o_c = paste(df$origin, df$carrier, sep = "_")
df = split(df, o_c) |>
lapply(
function(grp){
data.frame(origin = grp$origin[1],
carrier = grp$carrier[1],
mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE))
}
)
df = do.call(rbind, df)
df
```
## tibble
Here we simply tag the tibble with the desired grouping. Our first attempt is to apply an anonymous function to each group using `group_map`.
```{r}
tbl = read_flights(type = "tibble")
tbl = dplyr::group_by(tbl, origin, carrier) |>
dplyr::group_map(
function(grp, key){
key |>
dplyr::mutate(mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE))
}
) |>
dplyr::bind_rows()
tbl
```
But keep in mind that many of the `dplyr` functions are "group-aware", so we could simplify...
```{r}
tbl = read_flights(type = "tibble")
tbl = dplyr::group_by(tbl, origin, carrier) |>
dplyr::mutate(mean_dep_delay = mean(.data$dep_delay, na.rm = TRUE)) |>
dplyr::slice(1) |>
dplyr::select(origin, carrier, mean_dep_delay) |>
dplyr::ungroup()
tbl
```
## data.table
This is quite different. I'm wobbly enough at `data.table` to know how to do this, but not know why I do this step this particular way.
```{r}
dt = read_flights(type = "data.table")
dt[,
mean(.SD[[1]], na.rm = TRUE),
by = .(origin, carrier),
.SDcols = "dep_delay"]
```
## tidytable
This starts out quite similar to `tibble`, but you'll note that there is no need to create an anonymous function to use within `group_map()`. Instead, the `tidytable` helper functions are "group-aware".
```{r}
tt = read_flights(type = "tidytable")
tt = tidytable::group_by(tt, origin, carrier) |>
tidytable::mutate(mean_dep_delay = mean(.data$dep_delay, na.rm = TRUE)) |>
tidytable::slice(1) |>
tidytable::select(origin, carrier, mean_dep_delay) |>
tidytable::ungroup()
tt
```
If you have a more complex function to apply to each step, you can revert back to the more basic `split-do-bind` approach seen with data.frame.
```{r}
tt = read_flights(type = "tidytable")
tt = tidytable::group_by(tt, origin, carrier) |>
tidytable::group_split() |>
lapply(
function(grp){
tidytable::tidytable(
origin = grp$origin[1],
carrier = grp$carrier[1],
mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE)
)
}
) |>
tidytable::bind_rows()
tt
```