EVOLUTION-MANAGER
Edit File: datatable-reference-semantics.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="date" content="2022-10-09" /> <title>Reference semantics</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <script>// Hide empty <a> tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css"> code { white-space: pre; } .sourceCode { overflow: visible; } </style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } var j = 0; while (j < rules.length) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") { j++; continue; } var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') { j++; continue; } // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Reference semantics</h1> <h4 class="date">2022-10-09</h4> <p>This vignette discusses <em>data.table</em>’s reference semantics which allows to <em>add/update/delete</em> columns of a <em>data.table by reference</em>, and also combine them with <code>i</code> and <code>by</code>. It is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, and perform aggregations by group. If you’re not familiar with these concepts, please read the <em>“Introduction to data.table”</em> vignette first.</p> <hr /> <div id="data" class="section level2"> <h2>Data</h2> <p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true"></a>flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)</span> <span id="cb1-2"><a href="#cb1-2" aria-hidden="true"></a>flights</span> <span id="cb1-3"><a href="#cb1-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb1-4"><a href="#cb1-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb1-5"><a href="#cb1-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb1-6"><a href="#cb1-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb1-7"><a href="#cb1-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb1-8"><a href="#cb1-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb1-9"><a href="#cb1-9" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb1-10"><a href="#cb1-10" aria-hidden="true"></a><span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span></span> <span id="cb1-11"><a href="#cb1-11" aria-hidden="true"></a><span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span></span> <span id="cb1-12"><a href="#cb1-12" aria-hidden="true"></a><span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span></span> <span id="cb1-13"><a href="#cb1-13" aria-hidden="true"></a><span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span></span> <span id="cb1-14"><a href="#cb1-14" aria-hidden="true"></a><span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span></span> <span id="cb1-15"><a href="#cb1-15" aria-hidden="true"></a><span class="kw">dim</span>(flights)</span> <span id="cb1-16"><a href="#cb1-16" aria-hidden="true"></a><span class="co"># [1] 253316 11</span></span></code></pre></div> </div> <div id="introduction" class="section level2"> <h2>Introduction</h2> <p>In this vignette, we will</p> <ol style="list-style-type: decimal"> <li><p>first discuss reference semantics briefly and look at the two different forms in which the <code>:=</code> operator can be used</p></li> <li><p>then see how we can <em>add/update/delete</em> columns <em>by reference</em> in <code>j</code> using the <code>:=</code> operator and how to combine with <code>i</code> and <code>by</code>.</p></li> <li><p>and finally we will look at using <code>:=</code> for its <em>side-effect</em> and how we can avoid the side effects using <code>copy()</code>.</p></li> </ol> </div> <div id="reference-semantics" class="section level2"> <h2>1. Reference semantics</h2> <p>All the operations we have seen so far in the previous vignette resulted in a new data set. We will see how to <em>add</em> new column(s), <em>update</em> or <em>delete</em> existing column(s) on the original data.</p> <div id="a-background" class="section level3"> <h3>a) Background</h3> <p>Before we look at <em>reference semantics</em>, consider the <em>data.frame</em> shown below:</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true"></a>DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">ID =</span> <span class="kw">c</span>(<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"a"</span>,<span class="st">"a"</span>,<span class="st">"c"</span>), <span class="dt">a =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">6</span>, <span class="dt">b =</span> <span class="dv">7</span><span class="op">:</span><span class="dv">12</span>, <span class="dt">c =</span> <span class="dv">13</span><span class="op">:</span><span class="dv">18</span>)</span> <span id="cb2-2"><a href="#cb2-2" aria-hidden="true"></a>DF</span> <span id="cb2-3"><a href="#cb2-3" aria-hidden="true"></a><span class="co"># ID a b c</span></span> <span id="cb2-4"><a href="#cb2-4" aria-hidden="true"></a><span class="co"># 1 b 1 7 13</span></span> <span id="cb2-5"><a href="#cb2-5" aria-hidden="true"></a><span class="co"># 2 b 2 8 14</span></span> <span id="cb2-6"><a href="#cb2-6" aria-hidden="true"></a><span class="co"># 3 b 3 9 15</span></span> <span id="cb2-7"><a href="#cb2-7" aria-hidden="true"></a><span class="co"># 4 a 4 10 16</span></span> <span id="cb2-8"><a href="#cb2-8" aria-hidden="true"></a><span class="co"># 5 a 5 11 17</span></span> <span id="cb2-9"><a href="#cb2-9" aria-hidden="true"></a><span class="co"># 6 c 6 12 18</span></span></code></pre></div> <p>When we did:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true"></a>DF<span class="op">$</span>c <-<span class="st"> </span><span class="dv">18</span><span class="op">:</span><span class="dv">13</span> <span class="co"># (1) -- replace entire column</span></span> <span id="cb3-2"><a href="#cb3-2" aria-hidden="true"></a><span class="co"># or</span></span> <span id="cb3-3"><a href="#cb3-3" aria-hidden="true"></a>DF<span class="op">$</span>c[DF<span class="op">$</span>ID <span class="op">==</span><span class="st"> "b"</span>] <-<span class="st"> </span><span class="dv">15</span><span class="op">:</span><span class="dv">13</span> <span class="co"># (2) -- subassign in column 'c'</span></span></code></pre></div> <p>both (1) and (2) resulted in deep copy of the entire data.frame in versions of <code>R</code> versions <code>< 3.1</code>. <a href="https://stackoverflow.com/q/23898969/559784">It copied more than once</a>. To improve performance by avoiding these redundant copies, <em>data.table</em> utilised the <a href="https://stackoverflow.com/q/7033106/559784">available but unused <code>:=</code> operator in R</a>.</p> <p>Great performance improvements were made in <code>R v3.1</code> as a result of which only a <em>shallow</em> copy is made for (1) and not <em>deep</em> copy. However, for (2) still, the entire column is <em>deep</em> copied even in <code>R v3.1+</code>. This means the more columns one subassigns to in the <em>same query</em>, the more <em>deep</em> copies R does.</p> <div id="shallow-vs-deep-copy" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"><em>shallow</em> vs <em>deep</em> copy</h4> <p>A <em>shallow</em> copy is just a copy of the vector of column pointers (corresponding to the columns in a <em>data.frame</em> or <em>data.table</em>). The actual data is not physically copied in memory.</p> <p>A <em>deep</em> copy on the other hand copies the entire data to another location in memory.</p> </div> </div> </div> <div id="section" class="section level1"> <h1></h1> <p>With <em>data.table’s</em> <code>:=</code> operator, absolutely no copies are made in <em>both</em> (1) and (2), irrespective of R version you are using. This is because <code>:=</code> operator updates <em>data.table</em> columns <em>in-place</em> (by reference).</p> <div id="b-the-operator" class="section level3"> <h3>b) The <code>:=</code> operator</h3> <p>It can be used in <code>j</code> in two ways:</p> <ol style="list-style-type: lower-alpha"> <li><p>The <code>LHS := RHS</code> form</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true"></a>DT[, <span class="kw">c</span>(<span class="st">"colA"</span>, <span class="st">"colB"</span>, ...) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">list</span>(valA, valB, ...)]</span> <span id="cb4-2"><a href="#cb4-2" aria-hidden="true"></a></span> <span id="cb4-3"><a href="#cb4-3" aria-hidden="true"></a><span class="co"># when you have only one column to assign to you</span></span> <span id="cb4-4"><a href="#cb4-4" aria-hidden="true"></a><span class="co"># can drop the quotes and list(), for convenience</span></span> <span id="cb4-5"><a href="#cb4-5" aria-hidden="true"></a>DT[, colA <span class="op">:</span><span class="er">=</span><span class="st"> </span>valA]</span></code></pre></div></li> <li><p>The functional form</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true"></a>DT[, <span class="st">`</span><span class="dt">:=</span><span class="st">`</span>(<span class="dt">colA =</span> valA, <span class="co"># valA is assigned to colA</span></span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true"></a> <span class="dt">colB =</span> valB, <span class="co"># valB is assigned to colB</span></span> <span id="cb5-3"><a href="#cb5-3" aria-hidden="true"></a> ...</span> <span id="cb5-4"><a href="#cb5-4" aria-hidden="true"></a>)]</span></code></pre></div></li> </ol> <div id="section-1" class="section level4 bs-callout bs-callout-warning"> <h4 class="bs-callout bs-callout-warning"></h4> <p>Note that the code above explains how <code>:=</code> can be used. They are not working examples. We will start using them on <code>flights</code> <em>data.table</em> from the next section.</p> </div> </div> </div> <div id="section-2" class="section level1"> <h1></h1> <div id="section-3" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>In (a), <code>LHS</code> takes a character vector of column names and <code>RHS</code> a <em>list of values</em>. <code>RHS</code> just needs to be a <code>list</code>, irrespective of how its generated (e.g., using <code>lapply()</code>, <code>list()</code>, <code>mget()</code>, <code>mapply()</code> etc.). This form is usually easy to program with and is particularly useful when you don’t know the columns to assign values to in advance.</p></li> <li><p>On the other hand, (b) is handy if you would like to jot some comments down for later.</p></li> <li><p>The result is returned <em>invisibly</em>.</p></li> <li><p>Since <code>:=</code> is available in <code>j</code>, we can combine it with <code>i</code> and <code>by</code> operations just like the aggregation operations we saw in the previous vignette.</p></li> </ul> </div> </div> <div id="section-4" class="section level1"> <h1></h1> <p>In the two forms of <code>:=</code> shown above, note that we don’t assign the result back to a variable. Because we don’t need to. The input <em>data.table</em> is modified by reference. Let’s go through examples to understand what we mean by this.</p> <p>For the rest of the vignette, we will work with <code>flights</code> <em>data.table</em>.</p> <div id="addupdatedelete-columns-by-reference" class="section level2"> <h2>2. Add/update/delete columns <em>by reference</em></h2> <div id="ref-j" class="section level3"> <h3>a) Add columns by reference</h3> <div id="how-can-we-add-columns-speed-and-total-delay-of-each-flight-to-flights-data.table" class="section level4"> <h4>– How can we add columns <em>speed</em> and <em>total delay</em> of each flight to <code>flights</code> <em>data.table</em>?</h4> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true"></a>flights[, <span class="st">`</span><span class="dt">:=</span><span class="st">`</span>(<span class="dt">speed =</span> distance <span class="op">/</span><span class="st"> </span>(air_time<span class="op">/</span><span class="dv">60</span>), <span class="co"># speed in mph (mi/h)</span></span> <span id="cb6-2"><a href="#cb6-2" aria-hidden="true"></a> <span class="dt">delay =</span> arr_delay <span class="op">+</span><span class="st"> </span>dep_delay)] <span class="co"># delay in minutes</span></span> <span id="cb6-3"><a href="#cb6-3" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb6-4"><a href="#cb6-4" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed delay</span></span> <span id="cb6-5"><a href="#cb6-5" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 27</span></span> <span id="cb6-6"><a href="#cb6-6" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 10</span></span> <span id="cb6-7"><a href="#cb6-7" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 11</span></span> <span id="cb6-8"><a href="#cb6-8" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 -34</span></span> <span id="cb6-9"><a href="#cb6-9" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 3</span></span> <span id="cb6-10"><a href="#cb6-10" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 4</span></span> <span id="cb6-11"><a href="#cb6-11" aria-hidden="true"></a></span> <span id="cb6-12"><a href="#cb6-12" aria-hidden="true"></a><span class="co">## alternatively, using the 'LHS := RHS' form</span></span> <span id="cb6-13"><a href="#cb6-13" aria-hidden="true"></a><span class="co"># flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]</span></span></code></pre></div> </div> <div id="note-that" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Note that</h4> <ul> <li><p>We did not have to assign the result back to <code>flights</code>.</p></li> <li><p>The <code>flights</code> <em>data.table</em> now contains the two newly added columns. This is what we mean by <em>added by reference</em>.</p></li> <li><p>We used the functional form so that we could add comments on the side to explain what the computation does. You can also see the <code>LHS := RHS</code> form (commented).</p></li> </ul> </div> </div> <div id="ref-i-j" class="section level3"> <h3>b) Update some rows of columns by reference - <em>sub-assign</em> by reference</h3> <p>Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true"></a><span class="co"># get all 'hours' in flights</span></span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb7-3"><a href="#cb7-3" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></span></code></pre></div> <p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>.</p> <div id="replace-those-rows-where-hour-24-with-the-value-0" class="section level4"> <h4>– Replace those rows where <code>hour == 24</code> with the value <code>0</code></h4> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a><span class="co"># subassign by reference</span></span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true"></a>flights[hour <span class="op">==</span><span class="st"> </span>24L, hour <span class="op">:</span><span class="er">=</span><span class="st"> </span>0L]</span></code></pre></div> </div> <div id="section-5" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We can use <code>i</code> along with <code>:=</code> in <code>j</code> the very same way as we have already seen in the <em>“Introduction to data.table”</em> vignette.</p></li> <li><p>Column <code>hour</code> is replaced with <code>0</code> only on those <em>row indices</em> where the condition <code>hour == 24L</code> specified in <code>i</code> evaluates to <code>TRUE</code>.</p></li> <li><p><code>:=</code> returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty <code>[]</code> at the end of the query as shown below:</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true"></a>flights[hour <span class="op">==</span><span class="st"> </span>24L, hour <span class="op">:</span><span class="er">=</span><span class="st"> </span>0L][]</span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span></span> <span id="cb9-3"><a href="#cb9-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span></span> <span id="cb9-4"><a href="#cb9-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span></span> <span id="cb9-5"><a href="#cb9-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span></span> <span id="cb9-6"><a href="#cb9-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span></span> <span id="cb9-7"><a href="#cb9-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span></span> <span id="cb9-8"><a href="#cb9-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb9-9"><a href="#cb9-9" aria-hidden="true"></a><span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 422.6866</span></span> <span id="cb9-10"><a href="#cb9-10" aria-hidden="true"></a><span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 444.4444</span></span> <span id="cb9-11"><a href="#cb9-11" aria-hidden="true"></a><span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 311.5663</span></span> <span id="cb9-12"><a href="#cb9-12" aria-hidden="true"></a><span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 401.6000</span></span> <span id="cb9-13"><a href="#cb9-13" aria-hidden="true"></a><span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 359.4545</span></span> <span id="cb9-14"><a href="#cb9-14" aria-hidden="true"></a><span class="co"># delay</span></span> <span id="cb9-15"><a href="#cb9-15" aria-hidden="true"></a><span class="co"># 1: 27</span></span> <span id="cb9-16"><a href="#cb9-16" aria-hidden="true"></a><span class="co"># 2: 10</span></span> <span id="cb9-17"><a href="#cb9-17" aria-hidden="true"></a><span class="co"># 3: 11</span></span> <span id="cb9-18"><a href="#cb9-18" aria-hidden="true"></a><span class="co"># 4: -34</span></span> <span id="cb9-19"><a href="#cb9-19" aria-hidden="true"></a><span class="co"># 5: 3</span></span> <span id="cb9-20"><a href="#cb9-20" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb9-21"><a href="#cb9-21" aria-hidden="true"></a><span class="co"># 253312: -29</span></span> <span id="cb9-22"><a href="#cb9-22" aria-hidden="true"></a><span class="co"># 253313: -19</span></span> <span id="cb9-23"><a href="#cb9-23" aria-hidden="true"></a><span class="co"># 253314: 8</span></span> <span id="cb9-24"><a href="#cb9-24" aria-hidden="true"></a><span class="co"># 253315: 11</span></span> <span id="cb9-25"><a href="#cb9-25" aria-hidden="true"></a><span class="co"># 253316: -4</span></span></code></pre></div></li> </ul> </div> </div> </div> </div> <div id="section-6" class="section level1"> <h1></h1> <p>Let’s look at all the <code>hours</code> to verify.</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true"></a><span class="co"># check again for '24'</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></span></code></pre></div> <div id="update-by-reference-question" class="section level4 bs-callout bs-callout-warning"> <h4 class="bs-callout bs-callout-warning">Exercise:</h4> <p>What is the difference between <code>flights[hour == 24L, hour := 0L]</code> and <code>flights[hour == 24L][, hour := 0L]</code>? Hint: The latter needs an assignment (<code><-</code>) if you would want to use the result later.</p> <p>If you can’t figure it out, have a look at the <code>Note</code> section of <code>?":="</code>.</p> </div> <div id="c-delete-column-by-reference" class="section level3"> <h3>c) Delete column by reference</h3> <div id="remove-delay-column" class="section level4"> <h4>– Remove <code>delay</code> column</h4> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true"></a>flights[, <span class="kw">c</span>(<span class="st">"delay"</span>) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span></span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span></span> <span id="cb11-5"><a href="#cb11-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span></span> <span id="cb11-6"><a href="#cb11-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span></span> <span id="cb11-7"><a href="#cb11-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span></span> <span id="cb11-8"><a href="#cb11-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span></span> <span id="cb11-9"><a href="#cb11-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363</span></span> <span id="cb11-10"><a href="#cb11-10" aria-hidden="true"></a></span> <span id="cb11-11"><a href="#cb11-11" aria-hidden="true"></a><span class="co">## or using the functional form</span></span> <span id="cb11-12"><a href="#cb11-12" aria-hidden="true"></a><span class="co"># flights[, `:=`(delay = NULL)]</span></span></code></pre></div> </div> <div id="delete-convenience" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Assigning <code>NULL</code> to a column <em>deletes</em> that column. And it happens <em>instantly</em>.</p></li> <li><p>We can also pass column numbers instead of names in the <code>LHS</code>, although it is good programming practice to use column names.</p></li> <li><p>When there is just one column to delete, we can drop the <code>c()</code> and double quotes and just use the column name <em>unquoted</em>, for convenience. That is:</p> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true"></a>flights[, delay <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]</span></code></pre></div> <p>is equivalent to the code above.</p></li> </ul> </div> </div> <div id="ref-j-by" class="section level3"> <h3>d) <code>:=</code> along with grouping using <code>by</code></h3> <p>We have already seen the use of <code>i</code> along with <code>:=</code> in <a href="#ref-i-j">Section 2b</a>. Let’s now see how we can use <code>:=</code> along with <code>by</code>.</p> <div id="how-can-we-add-a-new-column-which-contains-for-each-origdest-pair-the-maximum-speed" class="section level4"> <h4>– How can we add a new column which contains for each <code>orig,dest</code> pair the maximum speed?</h4> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true"></a>flights[, max_speed <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">max</span>(speed), by =<span class="st"> </span>.(origin, dest)]</span> <span id="cb13-2"><a href="#cb13-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb13-3"><a href="#cb13-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed</span></span> <span id="cb13-4"><a href="#cb13-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957</span></span> <span id="cb13-5"><a href="#cb13-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957</span></span> <span id="cb13-6"><a href="#cb13-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957</span></span> <span id="cb13-7"><a href="#cb13-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000</span></span> <span id="cb13-8"><a href="#cb13-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957</span></span> <span id="cb13-9"><a href="#cb13-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507</span></span></code></pre></div> </div> <div id="section-7" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We add a new column <code>max_speed</code> using the <code>:=</code> operator by reference.</p></li> <li><p>We provide the columns to group by the same way as shown in the <em>Introduction to data.table</em> vignette. For each group, <code>max(speed)</code> is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. <code>flights</code> <em>data.table</em> is modified <em>in-place</em>.</p></li> <li><p>We could have also provided <code>by</code> with a <em>character vector</em> as we saw in the <em>Introduction to data.table</em> vignette, e.g., <code>by = c("origin", "dest")</code>.</p></li> </ul> </div> </div> </div> <div id="section-8" class="section level1"> <h1></h1> <div id="e-multiple-columns-and" class="section level3"> <h3>e) Multiple columns and <code>:=</code></h3> <div id="how-can-we-add-two-more-columns-computing-max-of-dep_delay-and-arr_delay-for-each-month-using-.sd" class="section level4"> <h4>– How can we add two more columns computing <code>max()</code> of <code>dep_delay</code> and <code>arr_delay</code> for each month, using <code>.SD</code>?</h4> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true"></a>in_cols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"dep_delay"</span>, <span class="st">"arr_delay"</span>)</span> <span id="cb14-2"><a href="#cb14-2" aria-hidden="true"></a>out_cols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"max_dep_delay"</span>, <span class="st">"max_arr_delay"</span>)</span> <span id="cb14-3"><a href="#cb14-3" aria-hidden="true"></a>flights[, <span class="kw">c</span>(out_cols) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">lapply</span>(.SD, max), by =<span class="st"> </span>month, .SDcols =<span class="st"> </span>in_cols]</span> <span id="cb14-4"><a href="#cb14-4" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb14-5"><a href="#cb14-5" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed max_speed</span></span> <span id="cb14-6"><a href="#cb14-6" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490 526.5957</span></span> <span id="cb14-7"><a href="#cb14-7" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909 526.5957</span></span> <span id="cb14-8"><a href="#cb14-8" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769 526.5957</span></span> <span id="cb14-9"><a href="#cb14-9" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414 517.5000</span></span> <span id="cb14-10"><a href="#cb14-10" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857 526.5957</span></span> <span id="cb14-11"><a href="#cb14-11" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363 518.4507</span></span> <span id="cb14-12"><a href="#cb14-12" aria-hidden="true"></a><span class="co"># max_dep_delay max_arr_delay</span></span> <span id="cb14-13"><a href="#cb14-13" aria-hidden="true"></a><span class="co"># 1: 973 996</span></span> <span id="cb14-14"><a href="#cb14-14" aria-hidden="true"></a><span class="co"># 2: 973 996</span></span> <span id="cb14-15"><a href="#cb14-15" aria-hidden="true"></a><span class="co"># 3: 973 996</span></span> <span id="cb14-16"><a href="#cb14-16" aria-hidden="true"></a><span class="co"># 4: 973 996</span></span> <span id="cb14-17"><a href="#cb14-17" aria-hidden="true"></a><span class="co"># 5: 973 996</span></span> <span id="cb14-18"><a href="#cb14-18" aria-hidden="true"></a><span class="co"># 6: 973 996</span></span></code></pre></div> </div> <div id="section-9" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We use the <code>LHS := RHS</code> form. We store the input column names and the new columns to add in separate variables and provide them to <code>.SDcols</code> and for <code>LHS</code> (for better readability).</p></li> <li><p>Note that since we allow assignment by reference without quoting column names when there is only one column as explained in <a href="#delete-convenience">Section 2c</a>, we can not do <code>out_cols := lapply(.SD, max)</code>. That would result in adding one new column named <code>out_col</code>. Instead we should do either <code>c(out_cols)</code> or simply <code>(out_cols)</code>. Wrapping the variable name with <code>(</code> is enough to differentiate between the two cases.</p></li> <li><p>The <code>LHS := RHS</code> form allows us to operate on multiple columns. In the RHS, to compute the <code>max</code> on columns specified in <code>.SDcols</code>, we make use of the base function <code>lapply()</code> along with <code>.SD</code> in the same way as we have seen before in the <em>“Introduction to data.table”</em> vignette. It returns a list of two elements, containing the maximum value corresponding to <code>dep_delay</code> and <code>arr_delay</code> for each group.</p></li> </ul> </div> </div> </div> <div id="section-10" class="section level1"> <h1></h1> <p>Before moving on to the next section, let’s clean up the newly created columns <code>speed</code>, <code>max_speed</code>, <code>max_dep_delay</code> and <code>max_arr_delay</code>.</p> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true"></a><span class="co"># RHS gets automatically recycled to length of LHS</span></span> <span id="cb15-2"><a href="#cb15-2" aria-hidden="true"></a>flights[, <span class="kw">c</span>(<span class="st">"speed"</span>, <span class="st">"max_speed"</span>, <span class="st">"max_dep_delay"</span>, <span class="st">"max_arr_delay"</span>) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]</span> <span id="cb15-3"><a href="#cb15-3" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb15-4"><a href="#cb15-4" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb15-5"><a href="#cb15-5" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb15-6"><a href="#cb15-6" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb15-7"><a href="#cb15-7" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb15-8"><a href="#cb15-8" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb15-9"><a href="#cb15-9" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb15-10"><a href="#cb15-10" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span></code></pre></div> <div id="and-copy" class="section level2"> <h2>3) <code>:=</code> and <code>copy()</code></h2> <p><code>:=</code> modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use <code>copy()</code> function, as we will see in a moment.</p> <div id="a-for-its-side-effect" class="section level3"> <h3>a) <code>:=</code> for its side effect</h3> <p>Let’s say we would like to create a function that would return the <em>maximum speed</em> for each month. But at the same time, we would also like to add the column <code>speed</code> to <em>flights</em>. We could write a simple function as follows:</p> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true"></a>foo <-<span class="st"> </span><span class="cf">function</span>(DT) {</span> <span id="cb16-2"><a href="#cb16-2" aria-hidden="true"></a> DT[, speed <span class="op">:</span><span class="er">=</span><span class="st"> </span>distance <span class="op">/</span><span class="st"> </span>(air_time<span class="op">/</span><span class="dv">60</span>)]</span> <span id="cb16-3"><a href="#cb16-3" aria-hidden="true"></a> DT[, .(<span class="dt">max_speed =</span> <span class="kw">max</span>(speed)), by =<span class="st"> </span>month]</span> <span id="cb16-4"><a href="#cb16-4" aria-hidden="true"></a>}</span> <span id="cb16-5"><a href="#cb16-5" aria-hidden="true"></a>ans =<span class="st"> </span><span class="kw">foo</span>(flights)</span> <span id="cb16-6"><a href="#cb16-6" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb16-7"><a href="#cb16-7" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour speed</span></span> <span id="cb16-8"><a href="#cb16-8" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 413.6490</span></span> <span id="cb16-9"><a href="#cb16-9" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 409.0909</span></span> <span id="cb16-10"><a href="#cb16-10" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 423.0769</span></span> <span id="cb16-11"><a href="#cb16-11" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 395.5414</span></span> <span id="cb16-12"><a href="#cb16-12" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 424.2857</span></span> <span id="cb16-13"><a href="#cb16-13" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18 434.3363</span></span> <span id="cb16-14"><a href="#cb16-14" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb16-15"><a href="#cb16-15" aria-hidden="true"></a><span class="co"># month max_speed</span></span> <span id="cb16-16"><a href="#cb16-16" aria-hidden="true"></a><span class="co"># 1: 1 535.6425</span></span> <span id="cb16-17"><a href="#cb16-17" aria-hidden="true"></a><span class="co"># 2: 2 535.6425</span></span> <span id="cb16-18"><a href="#cb16-18" aria-hidden="true"></a><span class="co"># 3: 3 549.0756</span></span> <span id="cb16-19"><a href="#cb16-19" aria-hidden="true"></a><span class="co"># 4: 4 585.6000</span></span> <span id="cb16-20"><a href="#cb16-20" aria-hidden="true"></a><span class="co"># 5: 5 544.2857</span></span> <span id="cb16-21"><a href="#cb16-21" aria-hidden="true"></a><span class="co"># 6: 6 608.5714</span></span></code></pre></div> <div id="section-11" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Note that the new column <code>speed</code> has been added to <code>flights</code> <em>data.table</em>. This is because <code>:=</code> performs operations by reference. Since <code>DT</code> (the function argument) and <code>flights</code> refer to the same object in memory, modifying <code>DT</code> also reflects on <code>flights</code>.</p></li> <li><p>And <code>ans</code> contains the maximum speed for each month.</p></li> </ul> </div> </div> <div id="b-the-copy-function" class="section level3"> <h3>b) The <code>copy()</code> function</h3> <p>In the previous section, we used <code>:=</code> for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a <em>data.table</em> object to a function, and might want to use the <code>:=</code> operator, but <em>wouldn’t</em> want to update the original object. We can accomplish this using the function <code>copy()</code>.</p> <div id="section-12" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <p>The <code>copy()</code> function <em>deep</em> copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.</p> </div> </div> </div> </div> <div id="section-13" class="section level1"> <h1></h1> <p>There are two particular places where <code>copy()</code> function is essential:</p> <ol style="list-style-type: decimal"> <li><p>Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified <em>by reference</em>. As an example, let’s consider the task in the previous section, except we don’t want to modify <code>flights</code> by reference.</p> <p>Let’s first delete the <code>speed</code> column we generated in the previous section.</p> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true"></a>flights[, speed <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="ot">NULL</span>]</span></code></pre></div> <p>Now, we could accomplish the task as follows:</p> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true"></a>foo <-<span class="st"> </span><span class="cf">function</span>(DT) {</span> <span id="cb18-2"><a href="#cb18-2" aria-hidden="true"></a> DT <-<span class="st"> </span><span class="kw">copy</span>(DT) <span class="co">## deep copy</span></span> <span id="cb18-3"><a href="#cb18-3" aria-hidden="true"></a> DT[, speed <span class="op">:</span><span class="er">=</span><span class="st"> </span>distance <span class="op">/</span><span class="st"> </span>(air_time<span class="op">/</span><span class="dv">60</span>)] <span class="co">## doesn't affect 'flights'</span></span> <span id="cb18-4"><a href="#cb18-4" aria-hidden="true"></a> DT[, .(<span class="dt">max_speed =</span> <span class="kw">max</span>(speed)), by =<span class="st"> </span>month]</span> <span id="cb18-5"><a href="#cb18-5" aria-hidden="true"></a>}</span> <span id="cb18-6"><a href="#cb18-6" aria-hidden="true"></a>ans <-<span class="st"> </span><span class="kw">foo</span>(flights)</span> <span id="cb18-7"><a href="#cb18-7" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb18-8"><a href="#cb18-8" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb18-9"><a href="#cb18-9" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb18-10"><a href="#cb18-10" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb18-11"><a href="#cb18-11" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb18-12"><a href="#cb18-12" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb18-13"><a href="#cb18-13" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb18-14"><a href="#cb18-14" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span> <span id="cb18-15"><a href="#cb18-15" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb18-16"><a href="#cb18-16" aria-hidden="true"></a><span class="co"># month max_speed</span></span> <span id="cb18-17"><a href="#cb18-17" aria-hidden="true"></a><span class="co"># 1: 1 535.6425</span></span> <span id="cb18-18"><a href="#cb18-18" aria-hidden="true"></a><span class="co"># 2: 2 535.6425</span></span> <span id="cb18-19"><a href="#cb18-19" aria-hidden="true"></a><span class="co"># 3: 3 549.0756</span></span> <span id="cb18-20"><a href="#cb18-20" aria-hidden="true"></a><span class="co"># 4: 4 585.6000</span></span> <span id="cb18-21"><a href="#cb18-21" aria-hidden="true"></a><span class="co"># 5: 5 544.2857</span></span> <span id="cb18-22"><a href="#cb18-22" aria-hidden="true"></a><span class="co"># 6: 6 608.5714</span></span></code></pre></div></li> </ol> <div id="section-14" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Using <code>copy()</code> function did not update <code>flights</code> <em>data.table</em> by reference. It doesn’t contain the column <code>speed</code>.</p></li> <li><p>And <code>ans</code> contains the maximum speed corresponding to each month.</p></li> </ul> <p>However we could improve this functionality further by <em>shallow</em> copying instead of <em>deep</em> copying. In fact, we would very much like to <a href="https://github.com/Rdatatable/data.table/issues/617">provide this functionality for <code>v1.9.8</code></a>. We will touch up on this again in the <em>data.table design</em> vignette.</p> </div> </div> <div id="section-15" class="section level1"> <h1></h1> <ol start="2" style="list-style-type: decimal"> <li><p>When we store the column names on to a variable, e.g., <code>DT_n = names(DT)</code>, and then <em>add/update/delete</em> column(s) <em>by reference</em>. It would also modify <code>DT_n</code>, unless we do <code>copy(names(DT))</code>.</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true"></a>DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> 1L, <span class="dt">y =</span> 2L)</span> <span id="cb19-2"><a href="#cb19-2" aria-hidden="true"></a>DT_n =<span class="st"> </span><span class="kw">names</span>(DT)</span> <span id="cb19-3"><a href="#cb19-3" aria-hidden="true"></a>DT_n</span> <span id="cb19-4"><a href="#cb19-4" aria-hidden="true"></a><span class="co"># [1] "x" "y"</span></span> <span id="cb19-5"><a href="#cb19-5" aria-hidden="true"></a></span> <span id="cb19-6"><a href="#cb19-6" aria-hidden="true"></a><span class="co">## add a new column by reference</span></span> <span id="cb19-7"><a href="#cb19-7" aria-hidden="true"></a>DT[, z <span class="op">:</span><span class="er">=</span><span class="st"> </span>3L]</span> <span id="cb19-8"><a href="#cb19-8" aria-hidden="true"></a></span> <span id="cb19-9"><a href="#cb19-9" aria-hidden="true"></a><span class="co">## DT_n also gets updated</span></span> <span id="cb19-10"><a href="#cb19-10" aria-hidden="true"></a>DT_n</span> <span id="cb19-11"><a href="#cb19-11" aria-hidden="true"></a><span class="co"># [1] "x" "y" "z"</span></span> <span id="cb19-12"><a href="#cb19-12" aria-hidden="true"></a></span> <span id="cb19-13"><a href="#cb19-13" aria-hidden="true"></a><span class="co">## use `copy()`</span></span> <span id="cb19-14"><a href="#cb19-14" aria-hidden="true"></a>DT_n =<span class="st"> </span><span class="kw">copy</span>(<span class="kw">names</span>(DT))</span> <span id="cb19-15"><a href="#cb19-15" aria-hidden="true"></a>DT[, w <span class="op">:</span><span class="er">=</span><span class="st"> </span>4L]</span> <span id="cb19-16"><a href="#cb19-16" aria-hidden="true"></a></span> <span id="cb19-17"><a href="#cb19-17" aria-hidden="true"></a><span class="co">## DT_n doesn't get updated</span></span> <span id="cb19-18"><a href="#cb19-18" aria-hidden="true"></a>DT_n</span> <span id="cb19-19"><a href="#cb19-19" aria-hidden="true"></a><span class="co"># [1] "x" "y" "z"</span></span></code></pre></div></li> </ol> <div id="summary" class="section level2"> <h2>Summary</h2> <div id="the-operator" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">The <code>:=</code> operator</h4> <ul> <li><p>It is used to <em>add/update/delete</em> columns by reference.</p></li> <li><p>We have also seen how to use <code>:=</code> along with <code>i</code> and <code>by</code> the same way as we have seen in the <em>Introduction to data.table</em> vignette. We can in the same way use <code>keyby</code>, chain operations together, and pass expressions to <code>by</code> as well all in the same way. The syntax is <em>consistent</em>.</p></li> <li><p>We can use <code>:=</code> for its side effect or use <code>copy()</code> to not modify the original object while updating by reference.</p></li> </ul> </div> </div> </div> <div id="section-16" class="section level1"> <h1></h1> <p>So far we have seen a whole lot in <code>j</code>, and how to combine it with <code>by</code> and little of <code>i</code>. Let’s turn our attention back to <code>i</code> in the next vignette <em>“Keys and fast binary search based subset”</em> to perform <em>blazing fast subsets</em> by <em>keying data.tables</em>.</p> <hr /> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>