EVOLUTION-MANAGER
Edit File: topic-data-mask-programming.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Data mask programming patterns</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for topic-data-mask-programming {rlang}"><tr><td>topic-data-mask-programming {rlang}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Data mask programming patterns</h2> <h3>Description</h3> <p><a href="topic-data-mask.html">Data-masking</a> functions require special programming patterns when used inside other functions. In this topic we'll review and compare the different patterns that can be used to solve specific problems. </p> <p>If you are a beginner, you might want to start with one of these tutorials: </p> <ul> <li> <p><a href="https://dplyr.tidyverse.org/articles/programming.html">Programming with dplyr</a> </p> </li> <li> <p><a href="https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html">Using ggplot2 in packages</a> </p> </li></ul> <p>If you'd like to go further and learn about defusing and injecting expressions, read the <a href="topic-metaprogramming.html">metaprogramming patterns topic</a>. </p> <h3>Choosing a pattern</h3> <p>Two main considerations determine which programming pattern you need to wrap a data-masking function: </p> <ol> <li><p> What behaviour does the <em>wrapped</em> function implement? </p> </li> <li><p> What behaviour should <em>your</em> function implement? </p> </li></ol> <p>Depending on the answers to these questions, you can choose between these approaches: </p> <ul> <li><p> The <strong>forwarding patterns</strong> with which your function inherits the behaviour of the function it interfaces with. </p> </li> <li><p> The <strong>name patterns</strong> with which your function takes strings or character vectors of column names. </p> </li> <li><p> The <strong>bridge patterns</strong> with which you change the behaviour of an argument instead of inheriting it. </p> </li></ul> <p>You will also need to use different solutions for single named arguments than for multiple arguments in <code>...</code>. </p> <h3>Argument behaviours</h3> <p>In a regular function, arguments can be defined in terms of a <em>type</em> of objects that they accept. An argument might accept a character vector, a data frame, a single logical value, etc. Data-masked arguments are more complex. Not only do they generally accept a specific type of objects (for instance <code>dplyr::mutate()</code> accepts vectors), they exhibit special computational behaviours. </p> <ul> <li><p> Data-masked expressions (base): E.g. <code><a href="../../base/html/transform.html">transform()</a></code>, <code><a href="../../base/html/with.html">with()</a></code>. Expressions may refer to the columns of the supplied data frame. </p> </li> <li><p> Data-masked expressions (tidy eval): E.g. <code>dplyr::mutate()</code>, <code>ggplot2::aes()</code>. Same as base data-masking but with tidy eval features enabled. This includes <a href="topic-inject.html">injection operators</a> such as <code><a href="embrace-operator.html">{{</a></code> and <code><a href="injection-operator.html">!!</a></code> and the <code><a href="dot-data.html">.data</a></code> and <code><a href="dot-data.html">.env</a></code> pronouns. </p> </li> <li><p> Data-masked symbols: Same as data-masked arguments but the supplied expressions must be simple column names. This often simplifies things, for instance this is an easy way of avoiding issues of <a href="topic-double-evaluation.html">double evaluation</a>. </p> </li> <li> <p><a href="https://tidyselect.r-lib.org/reference/language.html">Tidy selections</a>: E.g. <code>dplyr::select()</code>, <code>tidyr::pivot_longer()</code>. This is an alternative to data masking that supports selection helpers like <code>starts_with()</code> or <code>all_of()</code>, and implements special behaviour for operators like <code>c()</code>, <code>|</code> and <code>&</code>. </p> <p>Unlike data masking, tidy selection is an interpreted dialect. There is in fact no masking at all. Expressions are either interpreted in the context of the data frame (e.g. <code>c(cyl, am)</code> which stands for the union of the columns <code>cyl</code> and <code>am</code>), or evaluated in the user environment (e.g. <code>all_of()</code>, <code>starts_with()</code>, and any other expressions). This has implications for inheritance of argument behaviour as we will see below. </p> </li> <li> <p><a href="dyn-dots.html">Dynamic dots</a>: These may be data-masked arguments, tidy selections, or just regular arguments. Dynamic dots support injection of multiple arguments with the <code><a href="splice-operator.html">!!!</a></code> operator as well as name injection with <a href="glue-operators.html">glue</a> operators. </p> </li></ul> <p>To let users know about the capabilities of your function arguments, document them with the following tags, depending on which set of semantics they inherit from: </p> <div class="sourceCode"><pre>@param foo <[`data-masked`][dplyr::dplyr_data_masking]> What `foo` does. @param bar <[`tidy-select`][dplyr::dplyr_tidy_select]> What `bar` does. @param ... <[`dynamic-dots`][rlang::dyn-dots]> What these dots do. </pre></div> <h3>Forwarding patterns</h3> <p>With the forwarding patterns, arguments inherit the behaviour of the data-masked arguments they are passed in. </p> <h4>Embrace with <code style="white-space: pre;">{{</code></h4> <p>The embrace operator <code><a href="embrace-operator.html">{{</a></code> is a forwarding syntax for single arguments. You can forward an argument in data-masked context: </p> <div class="sourceCode r"><pre>my_summarise <- function(data, var) { data %>% dplyr::summarise({{ var }}) } </pre></div> <p>Or in tidyselections: </p> <div class="sourceCode r"><pre>my_pivot_longer <- function(data, var) { data %>% tidyr::pivot_longer(cols = {{ var }}) } </pre></div> <p>The function automatically inherits the behaviour of the surrounding context. For instance arguments forwarded to a data-masked context may refer to columns or use the <code><a href="dot-data.html">.data</a></code> pronoun: </p> <div class="sourceCode r"><pre>mtcars %>% my_summarise(mean(cyl)) x <- "cyl" mtcars %>% my_summarise(mean(.data[[x]])) </pre></div> <p>And arguments forwarded to a tidy selection may use all tidyselect features: </p> <div class="sourceCode r"><pre>mtcars %>% my_pivot_longer(cyl) mtcars %>% my_pivot_longer(vs:gear) mtcars %>% my_pivot_longer(starts_with("c")) x <- c("cyl", "am") mtcars %>% my_pivot_longer(all_of(x)) </pre></div> <h4>Forward <code>...</code></h4> <p>Simple forwarding of <code>...</code> arguments does not require any special syntax since dots are already a forwarding syntax. Just pass them to another function like you normally would. This works with data-masked arguments: </p> <div class="sourceCode r"><pre>my_group_by <- function(.data, ...) { .data %>% dplyr::group_by(...) } mtcars %>% my_group_by(cyl = cyl * 100, am) </pre></div> <p>As well as tidy selections: </p> <div class="sourceCode r"><pre>my_select <- function(.data, ...) { .data %>% dplyr::select(...) } mtcars %>% my_select(starts_with("c"), vs:carb) </pre></div> <p>Some functions take a tidy selection in a single named argument. In that case, pass the <code>...</code> inside <code>c()</code>: </p> <div class="sourceCode r"><pre>my_pivot_longer <- function(.data, ...) { .data %>% tidyr::pivot_longer(c(...)) } mtcars %>% my_pivot_longer(starts_with("c"), vs:carb) </pre></div> <p>Inside a tidy selection, <code>c()</code> is not a vector concatenator but a selection combinator. This makes it handy to interface between functions that take <code>...</code> and functions that take a single argument. </p> <h3>Names patterns</h3> <p>With the names patterns you refer to columns by name with strings or character vectors stored in env-variables. Whereas the forwarding patterns are exclusively used within a function to pass <em>arguments</em>, the names patterns can be used anywhere. </p> <ul> <li><p> In a script, you can loop over a character vector with <code>for</code> or <code>lapply()</code> and use the <code><a href="dot-data.html">.data</a></code> pattern to connect a name to its data-variable. A vector can also be supplied all at once to the tidy select helper <code>all_of()</code>. </p> </li> <li><p> In a function, using the names patterns on function arguments lets users supply regular data-variable names without any of the complications that come with data-masking. </p> </li></ul> <h4>Subsetting the <code>.data</code> pronoun</h4> <p>The <code><a href="dot-data.html">.data</a></code> pronoun is a tidy eval feature that is enabled in all data-masked arguments, just like <code><a href="embrace-operator.html">{{</a></code>. The pronoun represents the data mask and can be subsetted with <code>[[</code> and <code>$</code>. These three statements are equivalent: </p> <div class="sourceCode r"><pre>mtcars %>% dplyr::summarise(mean = mean(cyl)) mtcars %>% dplyr::summarise(mean = mean(.data$cyl)) var <- "cyl" mtcars %>% dplyr::summarise(mean = mean(.data[[var]])) </pre></div> <p>The <code>.data</code> pronoun can be subsetted in loops: </p> <div class="sourceCode r"><pre>vars <- c("cyl", "am") for (var in vars) print(dplyr::summarise(mtcars, mean = mean(.data[[var]]))) #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 6.19 #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 0.406 purrr::map(vars, ~ dplyr::summarise(mtcars, mean = mean(.data[[.x]]))) #> [[1]] #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 6.19 #> #> [[2]] #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 0.406 </pre></div> <p>And it can be used to connect function arguments to a data-variable: </p> <div class="sourceCode r"><pre>my_mean <- function(data, var) { data %>% dplyr::summarise(mean = mean(.data[[var]])) } my_mean(mtcars, "cyl") #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 6.19 </pre></div> <p>With this implementation, <code>my_mean()</code> is completely insulated from data-masking behaviour and is called like an ordinary function. </p> <div class="sourceCode r"><pre># No masking am <- "cyl" my_mean(mtcars, am) #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 6.19 # Programmable my_mean(mtcars, tolower("CYL")) #> # A tibble: 1 x 1 #> mean #> <dbl> #> 1 6.19 </pre></div> <h4>Character vector of names</h4> <p>The <code>.data</code> pronoun can only be subsetted with single column names. It doesn't support single-bracket indexing: </p> <div class="sourceCode r"><pre>mtcars %>% dplyr::summarise(.data[c("cyl", "am")]) #> Error in `dplyr::summarise()`: #> ! Can't compute `..1 = .data[c("cyl", "am")]`. #> Caused by error in `.data[c("cyl", "am")]`: #> ! `[` is not supported by the `.data` pronoun, use `[[` or $ instead. </pre></div> <p>There is no plural variant of <code>.data</code> built in tidy eval. Instead, we'll used the <code>all_of()</code> operator available in tidy selections to supply character vectors. This is straightforward in functions that take tidy selections, like <code>tidyr::pivot_longer()</code>: </p> <div class="sourceCode r"><pre>vars <- c("cyl", "am") mtcars %>% tidyr::pivot_longer(all_of(vars)) #> # A tibble: 64 x 11 #> mpg disp hp drat wt qsec vs gear carb name value #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 21 160 110 3.9 2.62 16.5 0 4 4 cyl 6 #> 2 21 160 110 3.9 2.62 16.5 0 4 4 am 1 #> 3 21 160 110 3.9 2.88 17.0 0 4 4 cyl 6 #> 4 21 160 110 3.9 2.88 17.0 0 4 4 am 1 #> # ... with 60 more rows </pre></div> <p>If the function does not take a tidy selection, it might be possible to use a <em>bridge pattern</em>. This option is presented in the bridge section below. If a bridge is impossible or inconvenient, a little metaprogramming with the <a href="topic-metaprogramming.html">symbolise-and-inject pattern</a> can help. </p> <h3>Bridge patterns</h3> <p>Sometimes the function you are calling does not implement the behaviour you would like to give to the arguments of your function. To work around this may require a little thought since there is no systematic way of turning one behaviour into another. The general technique consists in forwarding the arguments inside a context that implements the behaviour that you want. Then, find a way to bridge the result to the target verb or function. </p> <h4><code>across()</code> as a selection to data-mask bridge</h4> <p>dplyr 1.0 added support for tidy selections in all verbs via <code>across()</code>. This function is normally used for mapping over columns but can also be used to perform a simple selection. For instance, if you'd like to pass an argument to <code>group_by()</code> with a tidy-selection interface instead of a data-masked one, use <code>across()</code> as a bridge: </p> <div class="sourceCode r"><pre>my_group_by <- function(data, var) { data %>% dplyr::group_by(across({{ var }})) } mtcars %>% my_group_by(starts_with("c")) </pre></div> <p>Since <code>across()</code> takes selections in a single argument (unlike <code>select()</code> which takes multiple arguments), you can't directly pass <code>...</code>. Instead, take them within <code>c()</code>, which is the tidyselect way of supplying multiple selections within a single argument: </p> <div class="sourceCode r"><pre>my_group_by <- function(.data, ...) { .data %>% dplyr::group_by(across(c(...)})) } mtcars %>% my_group_by(starts_with("c"), vs:gear) </pre></div> <h4><code>across(all_of())</code> as a names to data mask bridge</h4> <p>If instead of forwarding variables in <code>across()</code> you pass them to <code>all_of()</code>, you create a names to data mask bridge. </p> <div class="sourceCode r"><pre>my_group_by <- function(data, vars) { data %>% dplyr::group_by(across(all_of(vars))) } mtcars %>% my_group_by(c("cyl", "am")) </pre></div> <p>Use this bridge technique to connect vectors of names to a data-masked context. </p> <h4><code>transmute()</code> as a data-mask to selection bridge</h4> <p>Passing data-masked arguments to a tidy selection is a little more tricky and requires a three step process. </p> <div class="sourceCode r"><pre>my_pivot_longer <- function(data, ...) { # Forward `...` in data-mask context with `transmute()` # and save the inputs names inputs <- dplyr::transmute(data, ...) names <- names(inputs) # Update the data with the inputs data <- dplyr::mutate(data, !!!inputs) # Select the inputs by name with `all_of()` tidyr::pivot_longer(data, cols = all_of(names)) } mtcars %>% my_pivot_longer(cyl, am = am * 100) </pre></div> <ol> <li><p> In a first step we pass the <code>...</code> expressions to <code>transmute()</code>. Unlike <code>mutate()</code>, it creates a new data frame from the user inputs. The only goal of this step is to inspect the names in <code>...</code>, including the default names created for unnamed arguments. </p> </li> <li><p> Once we have the names, we inject the arguments into <code>mutate()</code> to update the data frame. </p> </li> <li><p> Finally, we pass the names to the tidy selection via <a href="https://tidyselect.r-lib.org/reference/all_of.html"><code>all_of()</code></a>. </p> </li></ol> <h3>Transformation patterns</h3> <h4>Named inputs versus <code>...</code></h4> <p>In the case of a named argument, transformation is easy. We simply surround the embraced input it in R code. For instance, the <code>my_summarise()</code> function is not exactly useful compared to just calling <code>summarise()</code>: </p> <div class="sourceCode r"><pre>my_summarise <- function(data, var) { data %>% dplyr::summarise({{ var }}) } </pre></div> <p>We can make it more useful by adding code around the variable: </p> <div class="sourceCode r"><pre>my_mean <- function(data, var) { data %>% dplyr::summarise(mean = mean({{ var }}, na.rm = TRUE)) } </pre></div> <p>For inputs in <code>...</code> however, this technique does not work. We would need some kind of templating syntax for dots that lets us specify R code with a placeholder for the dots elements. This isn't built in tidy eval but you can use operators like <code>dplyr::across()</code>, <code>dplyr::if_all()</code>, or <code>dplyr::if_any()</code>. When that isn't possible, you can template the expression manually. </p> <h4>Transforming inputs with <code>across()</code></h4> <p>The <code>across()</code> operation in dplyr is a convenient way of mapping an expression across a set of inputs. We will create a variant of <code>my_mean()</code> that computes the <code>mean()</code> of all arguments supplied in <code>...</code>. The easiest way it to forward the dots to <code>across()</code> (which causes <code>...</code> to inherit its tidy selection behaviour): </p> <div class="sourceCode r"><pre>my_mean <- function(data, ...) { data %>% dplyr::summarise(across(c(...), ~ mean(.x, na.rm = TRUE))) } mtcars %>% my_mean(cyl, carb) #> # A tibble: 1 x 2 #> cyl carb #> <dbl> <dbl> #> 1 6.19 2.81 mtcars %>% my_mean(foo = cyl, bar = carb) #> # A tibble: 1 x 2 #> foo bar #> <dbl> <dbl> #> 1 6.19 2.81 mtcars %>% my_mean(starts_with("c"), mpg:disp) #> # A tibble: 1 x 4 #> cyl carb mpg disp #> <dbl> <dbl> <dbl> <dbl> #> 1 6.19 2.81 20.1 231. </pre></div> <h4>Transforming inputs with <code>if_all()</code> and <code>if_any()</code></h4> <p><code>dplyr::filter()</code> requires a different operation than <code>across()</code> because it needs to combine the logical expressions with <code>&</code> or <code>|</code>. To solve this problem dplyr introduced the <code>if_all()</code> and <code>if_any()</code> variants of <code>across()</code>. </p> <p>In the following example, we filter all rows for which a set of variables are not equal to their minimum value: </p> <div class="sourceCode r"><pre>filter_non_baseline <- function(.data, ...) { .data %>% dplyr::filter(if_all(c(...), ~ .x != min(.x, na.rm = TRUE))) } mtcars %>% filter_non_baseline(vs, am, gear) </pre></div> <hr /><div style="text-align: center;">[Package <em>rlang</em> version 1.0.6 <a href="00Index.html">Index</a>]</div> </body></html>