EVOLUTION-MANAGER
Edit File: translation-function.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Function translation</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Function translation</h1> <p>There are two parts to dbplyr SQL translation: translating dplyr verbs, and translating expressions within those verbs. This vignette describes how individual expressions (function calls) are translated; <code>vignette("translate-verb")</code> describes how entire verbs are translated.</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(dbplyr)</span> <span id="cb1-2"><a href="#cb1-2"></a><span class="kw">library</span>(dplyr)</span></code></pre></div> <p><code>dbplyr::translate_sql()</code> powers translation of individual function calls, and I’ll use it extensively in this vignette to show what’s happening. You shouldn’t need to use it ordinary code as dbplyr takes care of the translation automatically.</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">translate_sql</span>((x <span class="op">+</span><span class="st"> </span>y) <span class="op">/</span><span class="st"> </span><span class="dv">2</span>)</span> <span id="cb2-2"><a href="#cb2-2"></a><span class="co">#> <SQL> (`x` + `y`) / 2.0</span></span></code></pre></div> <p><code>translate_sql()</code> takes an optional <code>con</code> parameter. If not supplied, this causes dplyr to generate (approximately) SQL-92 compliant SQL. If supplied, dplyr uses <code>sql_translate_env()</code> to look up a custom environment which makes it possible for different databases to generate slightly different SQL: see <code>vignette("new-backend")</code> for more details. You can use the various simulate helpers to see the translations used by different backends:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">translate_sql</span>(x <span class="op">^</span><span class="st"> </span>2L)</span> <span id="cb3-2"><a href="#cb3-2"></a><span class="co">#> <SQL> POWER(`x`, 2)</span></span> <span id="cb3-3"><a href="#cb3-3"></a><span class="kw">translate_sql</span>(x <span class="op">^</span><span class="st"> </span>2L, <span class="dt">con =</span> <span class="kw">simulate_sqlite</span>())</span> <span id="cb3-4"><a href="#cb3-4"></a><span class="co">#> <SQL> POWER(`x`, 2)</span></span> <span id="cb3-5"><a href="#cb3-5"></a><span class="kw">translate_sql</span>(x <span class="op">^</span><span class="st"> </span>2L, <span class="dt">con =</span> <span class="kw">simulate_access</span>())</span> <span id="cb3-6"><a href="#cb3-6"></a><span class="co">#> <SQL> `x` ^ 2</span></span></code></pre></div> <p>Perfect translation is not possible because databases don’t have all the functions that R does. The goal of dplyr is to provide a semantic rather than a literal translation: what you mean, rather than precisely what is done. In fact, even for functions that exist both in databases and R, you shouldn’t expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, <code>mean()</code> loops through the data twice. R’s <code>mean()</code> also provides a <code>trim</code> option for computing trimmed means; this is something that databases do not provide.</p> <p>If you’re interested in how <code>translate_sql()</code> is implemented, the basic techniques that underlie the implementation of <code>translate_sql()</code> are described in <a href="http://adv-r.hadley.nz/translation.html">“Advanced R”</a>.</p> <div id="basic-differences" class="section level2"> <h2>Basic differences</h2> <p>The following examples work through some of the basic differences between R and SQL.</p> <ul> <li><p><code>"</code> and <code>'</code> mean different things</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="co"># In SQLite variable names are escaped by double quotes:</span></span> <span id="cb4-2"><a href="#cb4-2"></a><span class="kw">translate_sql</span>(x)</span> <span id="cb4-3"><a href="#cb4-3"></a><span class="co">#> <SQL> `x`</span></span> <span id="cb4-4"><a href="#cb4-4"></a><span class="co"># And strings are escaped by single quotes</span></span> <span id="cb4-5"><a href="#cb4-5"></a><span class="kw">translate_sql</span>(<span class="st">"x"</span>)</span> <span id="cb4-6"><a href="#cb4-6"></a><span class="co">#> <SQL> 'x'</span></span></code></pre></div></li> <li><p>And some functions have different argument orders:</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="kw">translate_sql</span>(<span class="kw">substr</span>(x, <span class="dv">5</span>, <span class="dv">10</span>))</span> <span id="cb5-2"><a href="#cb5-2"></a><span class="co">#> <SQL> SUBSTR(`x`, 5, 6)</span></span> <span id="cb5-3"><a href="#cb5-3"></a><span class="kw">translate_sql</span>(<span class="kw">log</span>(x, <span class="dv">10</span>))</span> <span id="cb5-4"><a href="#cb5-4"></a><span class="co">#> <SQL> LOG(10.0, `x`)</span></span></code></pre></div></li> <li><p>R and SQL have different defaults for integers and reals. In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real</p> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">translate_sql</span>(<span class="dv">1</span>)</span> <span id="cb6-2"><a href="#cb6-2"></a><span class="co">#> <SQL> 1.0</span></span> <span id="cb6-3"><a href="#cb6-3"></a><span class="kw">translate_sql</span>(1L)</span> <span id="cb6-4"><a href="#cb6-4"></a><span class="co">#> <SQL> 1</span></span></code></pre></div></li> </ul> </div> <div id="known-functions" class="section level2"> <h2>Known functions</h2> <div id="mathematics" class="section level3"> <h3>Mathematics</h3> <ul> <li>basic math operators: <code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>^</code></li> <li>trigonometry: <code>acos()</code>, <code>asin()</code>, <code>atan()</code>, <code>atan2()</code>, <code>cos()</code>, <code>cot()</code>, <code>tan()</code>, <code>sin()</code></li> <li>hypergeometric: <code>cosh()</code>, <code>coth()</code>, <code>sinh()</code>, <code>tanh()</code></li> <li>logarithmic: <code>log()</code>, <code>log10()</code>, <code>exp()</code></li> <li>misc: <code>abs()</code>, <code>ceiling()</code>, <code>sqrt()</code>, <code>sign()</code>, <code>round()</code></li> </ul> </div> </div> <div id="modulo-arithmetic" class="section level2"> <h2>Modulo arithmetic</h2> <p>dbplyr translates <code>%%</code> and <code>%/%</code> to their SQL equivalents but note that they are not precisely the same: most databases use truncated division where the modulo operator takes the sign of the dividend, where R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a>df <-<span class="st"> </span><span class="kw">tibble</span>(</span> <span id="cb7-2"><a href="#cb7-2"></a> <span class="dt">x =</span> <span class="kw">c</span>(10L, 10L, <span class="op">-</span>10L, <span class="op">-</span>10L), </span> <span id="cb7-3"><a href="#cb7-3"></a> <span class="dt">y =</span> <span class="kw">c</span>(3L, <span class="op">-</span>3L, 3L, <span class="op">-</span>3L)</span> <span id="cb7-4"><a href="#cb7-4"></a>)</span> <span id="cb7-5"><a href="#cb7-5"></a>mf <-<span class="st"> </span><span class="kw">tbl_memdb</span>(df)</span> <span id="cb7-6"><a href="#cb7-6"></a></span> <span id="cb7-7"><a href="#cb7-7"></a>df <span class="op">%>%</span><span class="st"> </span><span class="kw">mutate</span>(x <span class="op">%%</span><span class="st"> </span>y, x <span class="op">%/%</span><span class="st"> </span>y)</span> <span id="cb7-8"><a href="#cb7-8"></a><span class="co">#> # A tibble: 4 x 4</span></span> <span id="cb7-9"><a href="#cb7-9"></a><span class="co">#> x y `x%%y` `x%/%y`</span></span> <span id="cb7-10"><a href="#cb7-10"></a><span class="co">#> <int> <int> <int> <int></span></span> <span id="cb7-11"><a href="#cb7-11"></a><span class="co">#> 1 10 3 1 3</span></span> <span id="cb7-12"><a href="#cb7-12"></a><span class="co">#> 2 10 -3 -2 -4</span></span> <span id="cb7-13"><a href="#cb7-13"></a><span class="co">#> 3 -10 3 2 -4</span></span> <span id="cb7-14"><a href="#cb7-14"></a><span class="co">#> 4 -10 -3 -1 3</span></span> <span id="cb7-15"><a href="#cb7-15"></a></span> <span id="cb7-16"><a href="#cb7-16"></a>mf <span class="op">%>%</span><span class="st"> </span><span class="kw">mutate</span>(x <span class="op">%%</span><span class="st"> </span>y, x <span class="op">%/%</span><span class="st"> </span>y)</span> <span id="cb7-17"><a href="#cb7-17"></a><span class="co">#> # Source: lazy query [?? x 4]</span></span> <span id="cb7-18"><a href="#cb7-18"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb7-19"><a href="#cb7-19"></a><span class="co">#> x y `x%%y` `x%/%y`</span></span> <span id="cb7-20"><a href="#cb7-20"></a><span class="co">#> <int> <int> <int> <int></span></span> <span id="cb7-21"><a href="#cb7-21"></a><span class="co">#> 1 10 3 1 3</span></span> <span id="cb7-22"><a href="#cb7-22"></a><span class="co">#> 2 10 -3 1 -3</span></span> <span id="cb7-23"><a href="#cb7-23"></a><span class="co">#> 3 -10 3 -1 -3</span></span> <span id="cb7-24"><a href="#cb7-24"></a><span class="co">#> 4 -10 -3 -1 3</span></span></code></pre></div> <div id="logical-comparisons" class="section level3"> <h3>Logical comparisons</h3> <ul> <li>logical comparisons: <code><</code>, <code><=</code>, <code>!=</code>, <code>>=</code>, <code>></code>, <code>==</code>, <code>%in%</code></li> <li>boolean operations: <code>&</code>, <code>&&</code>, <code>|</code>, <code>||</code>, <code>!</code>, <code>xor()</code></li> </ul> </div> <div id="aggregation" class="section level3"> <h3>Aggregation</h3> <p>All database provide translation for the basic aggregations: <code>mean()</code>, <code>sum()</code>, <code>min()</code>, <code>max()</code>, <code>sd()</code>, <code>var()</code>. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. The aggregation functions warn you about this important difference:</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a><span class="kw">translate_sql</span>(<span class="kw">mean</span>(x))</span> <span id="cb8-2"><a href="#cb8-2"></a><span class="co">#> Warning: Missing values are always removed in SQL.</span></span> <span id="cb8-3"><a href="#cb8-3"></a><span class="co">#> Use `AVG(x, na.rm = TRUE)` to silence this warning</span></span> <span id="cb8-4"><a href="#cb8-4"></a><span class="co">#> This warning is displayed only once per session.</span></span> <span id="cb8-5"><a href="#cb8-5"></a><span class="co">#> <SQL> AVG(`x`) OVER ()</span></span> <span id="cb8-6"><a href="#cb8-6"></a><span class="kw">translate_sql</span>(<span class="kw">mean</span>(x, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))</span> <span id="cb8-7"><a href="#cb8-7"></a><span class="co">#> <SQL> AVG(`x`) OVER ()</span></span></code></pre></div> <p>Note that, by default, <code>translate()</code> assumes that the call is inside a <code>mutate()</code> or <code>filter()</code> and generates a window translation. If you want to see the equivalent <code>summarise()</code>/aggregation translation, use <code>window = FALSE</code>:</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a><span class="kw">translate_sql</span>(<span class="kw">mean</span>(x, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>), <span class="dt">window =</span> <span class="ot">FALSE</span>)</span> <span id="cb9-2"><a href="#cb9-2"></a><span class="co">#> <SQL> AVG(`x`)</span></span></code></pre></div> </div> <div id="conditional-evaluation" class="section level3"> <h3>Conditional evaluation</h3> <p><code>if</code> and <code>switch()</code> are translate to <code>CASE WHEN</code>:</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1"></a><span class="kw">translate_sql</span>(<span class="cf">if</span> (x <span class="op">></span><span class="st"> </span><span class="dv">5</span>) <span class="st">"big"</span> <span class="cf">else</span> <span class="st">"small"</span>)</span> <span id="cb10-2"><a href="#cb10-2"></a><span class="co">#> <SQL> CASE WHEN (`x` > 5.0) THEN ('big') WHEN NOT(`x` > 5.0) THEN ('small') </span><span class="re">END</span></span> <span id="cb10-3"><a href="#cb10-3"></a><span class="kw">translate_sql</span>(<span class="cf">switch</span>(x, <span class="dt">a =</span> 1L, <span class="dt">b =</span> 2L, 3L))</span> <span id="cb10-4"><a href="#cb10-4"></a><span class="co">#> <SQL> CASE `x` WHEN ('a') THEN (1) WHEN ('b') THEN (2) ELSE (3) </span><span class="re">END</span></span></code></pre></div> </div> <div id="string-manipulation" class="section level3"> <h3>String manipulation</h3> </div> <div id="datetime" class="section level3"> <h3>Date/time</h3> <ul> <li>string functions: <code>tolower</code>, <code>toupper</code>, <code>trimws</code>, <code>nchar</code>, <code>substr</code></li> <li>coerce types: <code>as.numeric</code>, <code>as.integer</code>, <code>as.character</code></li> </ul> </div> </div> <div id="unknown-functions" class="section level2"> <h2>Unknown functions</h2> <p>Any function that dplyr doesn’t know how to convert is left as is. This means that database functions that are not covered by dplyr can be used directly via <code>translate_sql()</code>. Here a couple of examples that will work with <a href="http://www.sqlite.org/lang_corefunc.html">SQLite</a>:</p> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a><span class="kw">translate_sql</span>(<span class="kw">glob</span>(x, y))</span> <span id="cb11-2"><a href="#cb11-2"></a><span class="co">#> <SQL> glob(`x`, `y`)</span></span> <span id="cb11-3"><a href="#cb11-3"></a><span class="kw">translate_sql</span>(x <span class="op">%like%</span><span class="st"> "ab%"</span>)</span> <span id="cb11-4"><a href="#cb11-4"></a><span class="co">#> <SQL> `x` like 'ab%'</span></span></code></pre></div> <p>See <code>vignette("sql")</code> for more details.</p> </div> <div id="window-functions" class="section level2"> <h2>Window functions</h2> <p>Things get a little trickier with window functions, because SQL’s window functions are considerably more expressive than the specific variants provided by base R or dplyr. They have the form <code>[expression] OVER ([partition clause] [order clause] [frame_clause])</code>:</p> <ul> <li><p>The <strong>expression</strong> is a combination of variable names and window functions. Support for window functions varies from database to database, but most support the ranking functions, <code>lead</code>, <code>lag</code>, <code>nth</code>, <code>first</code>, <code>last</code>, <code>count</code>, <code>min</code>, <code>max</code>, <code>sum</code>, <code>avg</code> and <code>stddev</code>.</p></li> <li><p>The <strong>partition clause</strong> specifies how the window function is broken down over groups. It plays an analogous role to <code>GROUP BY</code> for aggregate functions, and <code>group_by()</code> in dplyr. It is possible for different window functions to be partitioned into different groups, but not all databases support it, and neither does dplyr.</p></li> <li><p>The <strong>order clause</strong> controls the ordering (when it makes a difference). This is important for the ranking functions since it specifies which variables to rank by, but it’s also needed for cumulative functions and lead. Whenever you’re thinking about before and after in SQL, you must always tell it which variable defines the order. If the order clause is missing when needed, some databases fail with an error message while others return non-deterministic results.</p></li> <li><p>The <strong>frame clause</strong> defines which rows, or <strong>frame</strong>, that are passed to the window function, describing which rows (relative to the current row) should be included. The frame clause provides two offsets which determine the start and end of frame. There are three special values: -Inf means to include all preceding rows (in SQL, “unbounded preceding”), 0 means the current row (“current row”), and Inf means all following rows (“unbounded following”). The complete set of options is comprehensive, but fairly confusing, and is summarised visually below.</p> <p><img src="" width="100%" /></p> <p>Of the many possible specifications, there are only three that commonly used. They select between aggregation variants:</p> <ul> <li><p>Recycled: <code>BETWEEN UNBOUND PRECEEDING AND UNBOUND FOLLOWING</code></p></li> <li><p>Cumulative: <code>BETWEEN UNBOUND PRECEEDING AND CURRENT ROW</code></p></li> <li><p>Rolling: <code>BETWEEN 2 PRECEEDING AND 2 FOLLOWING</code></p></li> </ul> <p>dplyr generates the frame clause based on whether your using a recycled aggregate or a cumulative aggregate.</p></li> </ul> <p>To see how individual window functions are translated to SQL, we can again use <code>translate_sql()</code>:</p> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1"></a><span class="kw">translate_sql</span>(<span class="kw">mean</span>(G))</span> <span id="cb12-2"><a href="#cb12-2"></a><span class="co">#> <SQL> AVG(`G`) OVER ()</span></span> <span id="cb12-3"><a href="#cb12-3"></a><span class="kw">translate_sql</span>(<span class="kw">rank</span>(G))</span> <span id="cb12-4"><a href="#cb12-4"></a><span class="co">#> <SQL> RANK() OVER (ORDER BY `G`)</span></span> <span id="cb12-5"><a href="#cb12-5"></a><span class="kw">translate_sql</span>(<span class="kw">ntile</span>(G, <span class="dv">2</span>))</span> <span id="cb12-6"><a href="#cb12-6"></a><span class="co">#> <SQL> NTILE(2) OVER (ORDER BY `G`)</span></span> <span id="cb12-7"><a href="#cb12-7"></a><span class="kw">translate_sql</span>(<span class="kw">lag</span>(G))</span> <span id="cb12-8"><a href="#cb12-8"></a><span class="co">#> <SQL> LAG(`G`, 1, NULL) OVER ()</span></span></code></pre></div> <p>If the tbl has been grouped or arranged previously in the pipeline, then dplyr will use that information to set the “partition by” and “order by” clauses. For interactive exploration, you can achieve the same effect by setting the <code>vars_group</code> and <code>vars_order</code> arguments to <code>translate_sql()</code></p> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a><span class="kw">translate_sql</span>(<span class="kw">cummean</span>(G), <span class="dt">vars_order =</span> <span class="st">"year"</span>)</span> <span id="cb13-2"><a href="#cb13-2"></a><span class="co">#> <SQL> AVG(`G`) OVER (ORDER BY `year` ROWS UNBOUNDED PRECEDING)</span></span> <span id="cb13-3"><a href="#cb13-3"></a><span class="kw">translate_sql</span>(<span class="kw">rank</span>(), <span class="dt">vars_group =</span> <span class="st">"ID"</span>)</span> <span id="cb13-4"><a href="#cb13-4"></a><span class="co">#> <SQL> RANK() OVER (PARTITION BY `ID`)</span></span></code></pre></div> <p>There are some challenges when translating window functions between R and SQL, because dplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you’re using:</p> <ul> <li><p>For ranking functions, the ordering variable is the first argument: <code>rank(x)</code>, <code>ntile(y, 2)</code>. If omitted or <code>NULL</code>, will use the default ordering associated with the tbl (as set by <code>arrange()</code>).</p></li> <li><p>Accumulating aggregates only take a single argument (the vector to aggregate). To control ordering, use <code>order_by()</code>.</p></li> <li><p>Aggregates implemented in dplyr (<code>lead</code>, <code>lag</code>, <code>nth_value</code>, <code>first_value</code>, <code>last_value</code>) have an <code>order_by</code> argument. Supply it to override the default ordering.</p></li> </ul> <p>The three options are illustrated in the snippet below:</p> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1"></a><span class="kw">mutate</span>(players,</span> <span id="cb14-2"><a href="#cb14-2"></a> <span class="kw">min_rank</span>(yearID),</span> <span id="cb14-3"><a href="#cb14-3"></a> <span class="kw">order_by</span>(yearID, <span class="kw">cumsum</span>(G)),</span> <span id="cb14-4"><a href="#cb14-4"></a> <span class="kw">lead</span>(G, <span class="dt">order_by =</span> yearID)</span> <span id="cb14-5"><a href="#cb14-5"></a>)</span></code></pre></div> <p>Currently there is no way to order by multiple variables, except by setting the default ordering with <code>arrange()</code>. This will be added in a future release.</p> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>