EVOLUTION-MANAGER
Edit File: vec_locate_matches.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Locate observations matching specified conditions</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for vec_locate_matches {vctrs}"><tr><td>vec_locate_matches {vctrs}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Locate observations matching specified conditions</h2> <h3>Description</h3> <p><a href="https://lifecycle.r-lib.org/articles/stages.html#experimental"><img src="../help/figures/lifecycle-experimental.svg" alt='[Experimental]' /></a> </p> <p><code>vec_locate_matches()</code> is a more flexible version of <code><a href="vec_match.html">vec_match()</a></code> used to identify locations where each observation of <code>needles</code> matches one or multiple observations in <code>haystack</code>. Unlike <code>vec_match()</code>, <code>vec_locate_matches()</code> returns all matches by default, and can match on binary conditions other than equality, such as <code>></code>, <code>>=</code>, <code><</code>, and <code><=</code>. </p> <h3>Usage</h3> <pre> vec_locate_matches( needles, haystack, ..., condition = "==", filter = "none", incomplete = "compare", no_match = NA_integer_, remaining = "drop", multiple = "all", nan_distinct = FALSE, chr_proxy_collate = NULL, needles_arg = "", haystack_arg = "", error_call = current_env() ) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>needles, haystack</code></td> <td> <p>Vectors used for matching. </p> <ul> <li> <p><code>needles</code> represents the vector to search for. </p> </li> <li> <p><code>haystack</code> represents the vector to search in. </p> </li></ul> <p>Prior to comparison, <code>needles</code> and <code>haystack</code> are coerced to the same type.</p> </td></tr> <tr valign="top"><td><code>...</code></td> <td> <p>These dots are for future extensions and must be empty.</p> </td></tr> <tr valign="top"><td><code>condition</code></td> <td> <p>Condition controlling how <code>needles</code> should be compared against <code>haystack</code> to identify a successful match. </p> <ul> <li><p> One of: <code>"=="</code>, <code>">"</code>, <code>">="</code>, <code>"<"</code>, or <code>"<="</code>. </p> </li> <li><p> For data frames, a length <code>1</code> or <code>ncol(needles)</code> character vector containing only the above options, specifying how matching is determined for each column. </p> </li></ul> </td></tr> <tr valign="top"><td><code>filter</code></td> <td> <p>Filter to be applied to the matched results. </p> <ul> <li> <p><code>"none"</code> doesn't apply any filter. </p> </li> <li> <p><code>"min"</code> returns only the minimum haystack value matching the current needle. </p> </li> <li> <p><code>"max"</code> returns only the maximum haystack value matching the current needle. </p> </li> <li><p> For data frames, a length <code>1</code> or <code>ncol(needles)</code> character vector containing only the above options, specifying a filter to apply to each column. </p> </li></ul> <p>Filters don't have any effect on <code>"=="</code> conditions, but are useful for computing "rolling" matches with other conditions. </p> <p>A filter can return multiple haystack matches for a particular needle if the maximum or minimum haystack value is duplicated in <code>haystack</code>. These can be further controlled with <code>multiple</code>.</p> </td></tr> <tr valign="top"><td><code>incomplete</code></td> <td> <p>Handling of missing values and <a href="vec_detect_complete.html">incomplete</a> observations in <code>needles</code>. </p> <ul> <li> <p><code>"compare"</code> uses <code>condition</code> to determine whether or not a missing value in <code>needles</code> matches a missing value in <code>haystack</code>. If <code>condition</code> is <code>==</code>, <code>>=</code>, or <code><=</code>, then missing values will match. </p> </li> <li> <p><code>"match"</code> always allows missing values in <code>needles</code> to match missing values in <code>haystack</code>, regardless of the <code>condition</code>. </p> </li> <li> <p><code>"drop"</code> drops incomplete observations in <code>needles</code> from the result. </p> </li> <li> <p><code>"error"</code> throws an error if any <code>needles</code> are incomplete. </p> </li> <li><p> If a single integer is provided, this represents the value returned in the <code>haystack</code> column for observations of <code>needles</code> that are incomplete. If <code>no_match = NA</code>, setting <code>incomplete = NA</code> forces incomplete observations in <code>needles</code> to be treated like unmatched values. </p> </li></ul> <p><code>nan_distinct</code> determines whether a <code>NA</code> is allowed to match a <code>NaN</code>.</p> </td></tr> <tr valign="top"><td><code>no_match</code></td> <td> <p>Handling of <code>needles</code> without a match. </p> <ul> <li> <p><code>"drop"</code> drops <code>needles</code> with zero matches from the result. </p> </li> <li> <p><code>"error"</code> throws an error if any <code>needles</code> have zero matches. </p> </li> <li><p> If a single integer is provided, this represents the value returned in the <code>haystack</code> column for observations of <code>needles</code> that have zero matches. The default represents an unmatched needle with <code>NA</code>. </p> </li></ul> </td></tr> <tr valign="top"><td><code>remaining</code></td> <td> <p>Handling of <code>haystack</code> values that <code>needles</code> never matched. </p> <ul> <li> <p><code>"drop"</code> drops remaining <code>haystack</code> values from the result. Typically, this is the desired behavior if you only care when <code>needles</code> has a match. </p> </li> <li> <p><code>"error"</code> throws an error if there are any remaining <code>haystack</code> values. </p> </li> <li><p> If a single integer is provided (often <code>NA</code>), this represents the value returned in the <code>needles</code> column for the remaining <code>haystack</code> values that <code>needles</code> never matched. Remaining <code>haystack</code> values are always returned at the end of the result. </p> </li></ul> </td></tr> <tr valign="top"><td><code>multiple</code></td> <td> <p>Handling of <code>needles</code> with multiple matches. For each needle: </p> <ul> <li> <p><code>"all"</code> returns all matches detected in <code>haystack</code>. </p> </li> <li> <p><code>"any"</code> returns any match detected in <code>haystack</code> with no guarantees on which match will be returned. It is often faster than <code>"first"</code> and <code>"last"</code> if you just need to detect if there is at least one match. </p> </li> <li> <p><code>"first"</code> returns the first match detected in <code>haystack</code>. </p> </li> <li> <p><code>"last"</code> returns the last match detected in <code>haystack</code>. </p> </li> <li> <p><code>"warning"</code> throws a warning if multiple matches are detected, but otherwise falls back to <code>"all"</code>. </p> </li> <li> <p><code>"error"</code> throws an error if multiple matches are detected. </p> </li></ul> </td></tr> <tr valign="top"><td><code>nan_distinct</code></td> <td> <p>A single logical specifying whether or not <code>NaN</code> should be considered distinct from <code>NA</code> for double and complex vectors. If <code>TRUE</code>, <code>NaN</code> will always be ordered between <code>NA</code> and non-missing numbers.</p> </td></tr> <tr valign="top"><td><code>chr_proxy_collate</code></td> <td> <p>A function generating an alternate representation of character vectors to use for collation, often used for locale-aware ordering. </p> <ul> <li><p> If <code>NULL</code>, no transformation is done. </p> </li> <li><p> Otherwise, this must be a function of one argument. If the input contains a character vector, it will be passed to this function after it has been translated to UTF-8. This function should return a character vector with the same length as the input. The result should sort as expected in the C-locale, regardless of encoding. </p> </li></ul> <p>For data frames, <code>chr_proxy_collate</code> will be applied to all character columns. </p> <p>Common transformation functions include: <code>tolower()</code> for case-insensitive ordering and <code>stringi::stri_sort_key()</code> for locale-aware ordering.</p> </td></tr> <tr valign="top"><td><code>needles_arg, haystack_arg</code></td> <td> <p>Argument tags for <code>needles</code> and <code>haystack</code> used in error messages.</p> </td></tr> <tr valign="top"><td><code>error_call</code></td> <td> <p>The execution environment of a currently running function, e.g. <code>caller_env()</code>. The function will be mentioned in error messages as the source of the error. See the <code>call</code> argument of <code><a href="../../rlang/html/abort.html">abort()</a></code> for more information.</p> </td></tr> </table> <h3>Details</h3> <p><code><a href="vec_match.html">vec_match()</a></code> is identical to (but often slightly faster than): </p> <div class="sourceCode"><pre>vec_locate_matches( needles, haystack, condition = "==", multiple = "first", nan_distinct = TRUE ) </pre></div> <p><code>vec_locate_matches()</code> is extremely similar to a SQL join between <code>needles</code> and <code>haystack</code>, with the default being most similar to a left join. </p> <p>Be very careful when specifying match <code>condition</code>s. If a condition is mis-specified, it is very easy to accidentally generate an exponentially large number of matches. </p> <h3>Value</h3> <p>A two column data frame containing the locations of the matches. </p> <ul> <li> <p><code>needles</code> is an integer vector containing the location of the needle currently being matched. </p> </li> <li> <p><code>haystack</code> is an integer vector containing the location of the corresponding match in the haystack for the current needle. </p> </li></ul> <h3>Dependencies of <code>vec_locate_matches()</code></h3> <ul> <li> <p><code><a href="order-radix.html">vec_order_radix()</a></code> </p> </li> <li> <p><code><a href="vec_detect_complete.html">vec_detect_complete()</a></code> </p> </li></ul> <h3>Examples</h3> <pre> x <- c(1, 2, NA, 3, NaN) y <- c(2, 1, 4, NA, 1, 2, NaN) # By default, for each element of `x`, all matching locations in `y` are # returned matches <- vec_locate_matches(x, y) matches # The result can be used to slice the inputs to align them data_frame( x = vec_slice(x, matches$needles), y = vec_slice(y, matches$haystack) ) # If multiple matches are present, control which is returned with `multiple` vec_locate_matches(x, y, multiple = "first") vec_locate_matches(x, y, multiple = "last") vec_locate_matches(x, y, multiple = "any") try(vec_locate_matches(x, y, multiple = "error")) # By default, NA is treated as being identical to NaN. # Using `nan_distinct = TRUE` treats NA and NaN as different values, so NA # can only match NA, and NaN can only match NaN. vec_locate_matches(x, y, nan_distinct = TRUE) # If you never want missing values to match, set `incomplete = NA` to return # `NA` in the `haystack` column anytime there was an incomplete observation # in `needles`. vec_locate_matches(x, y, incomplete = NA) # `no_match` allows you to specify the returned value for a needle with # zero matches. Note that this is different from an incomplete value, # so specifying `no_match` allows you to differentiate between incomplete # values and unmatched values. vec_locate_matches(x, y, incomplete = NA, no_match = 0L) # If you want to require that every `needle` has at least 1 match, set # `no_match` to `"error"`: try(vec_locate_matches(x, y, incomplete = NA, no_match = "error")) # By default, `vec_locate_matches()` detects equality between `needles` and # `haystack`. Using `condition`, you can detect where an inequality holds # true instead. For example, to find every location where `x[[i]] >= y`: matches <- vec_locate_matches(x, y, condition = ">=") data_frame( x = vec_slice(x, matches$needles), y = vec_slice(y, matches$haystack) ) # You can limit which matches are returned with a `filter`. For example, # with the above example you can filter the matches returned by `x[[i]] >= y` # down to only the ones containing the maximum `y` value of those matches. matches <- vec_locate_matches(x, y, condition = ">=", filter = "max") # Here, the matches for the `3` needle value have been filtered down to # only include the maximum haystack value of those matches, `2`. This is # often referred to as a rolling join. data_frame( x = vec_slice(x, matches$needles), y = vec_slice(y, matches$haystack) ) # In the very rare case that you need to generate locations for a # cross match, where every observation of `x` is forced to match every # observation of `y` regardless of what the actual values are, you can # replace `x` and `y` with integer vectors of the same size that contain # a single value and match on those instead. x_proxy <- vec_rep(1L, vec_size(x)) y_proxy <- vec_rep(1L, vec_size(y)) nrow(vec_locate_matches(x_proxy, y_proxy)) vec_size(x) * vec_size(y) # By default, missing values will match other missing values when using # `==`, `>=`, or `<=` conditions, but not when using `>` or `<` conditions. # This is similar to how `vec_compare(x, y, na_equal = TRUE)` works. x <- c(1, NA) y <- c(NA, 2) vec_locate_matches(x, y, condition = "<=") vec_locate_matches(x, y, condition = "<") # You can force missing values to match regardless of the `condition` # by using `incomplete = "match"` vec_locate_matches(x, y, condition = "<", incomplete = "match") # You can also use data frames for `needles` and `haystack`. The # `condition` will be recycled to the number of columns in `needles`, or # you can specify varying conditions per column. In this example, we take # a vector of date `values` and find all locations where each value is # between lower and upper bounds specified by the `haystack`. values <- as.Date("2019-01-01") + 0:9 needles <- data_frame(lower = values, upper = values) set.seed(123) lower <- as.Date("2019-01-01") + sample(10, 10, replace = TRUE) upper <- lower + sample(3, 10, replace = TRUE) haystack <- data_frame(lower = lower, upper = upper) # (values >= lower) & (values <= upper) matches <- vec_locate_matches(needles, haystack, condition = c(">=", "<=")) data_frame( lower = vec_slice(lower, matches$haystack), value = vec_slice(values, matches$needle), upper = vec_slice(upper, matches$haystack) ) </pre> <hr /><div style="text-align: center;">[Package <em>vctrs</em> version 0.5.0 <a href="00Index.html">Index</a>]</div> </body></html>