EVOLUTION-MANAGER
Edit File: datatable-intro.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="date" content="2022-10-09" /> <title>Introduction to data.table</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <script>// Hide empty <a> tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css"> code { white-space: pre; } .sourceCode { overflow: visible; } </style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } var j = 0; while (j < rules.length) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") { j++; continue; } var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') { j++; continue; } // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Introduction to data.table</h1> <h4 class="date">2022-10-09</h4> <p>This vignette introduces the <code>data.table</code> syntax, its general form, how to <em>subset</em> rows, <em>select and compute</em> on columns, and perform aggregations <em>by group</em>. Familiarity with <code>data.frame</code> data structure from base R is useful, but not essential to follow this vignette.</p> <hr /> <div id="data-analysis-using-data.table" class="section level2"> <h2>Data analysis using <code>data.table</code></h2> <p>Data manipulation operations such as <em>subset</em>, <em>group</em>, <em>update</em>, <em>join</em> etc., are all inherently related. Keeping these <em>related operations together</em> allows for:</p> <ul> <li><p><em>concise</em> and <em>consistent</em> syntax irrespective of the set of operations you would like to perform to achieve your end goal.</p></li> <li><p>performing analysis <em>fluidly</em> without the cognitive burden of having to map each operation to a particular function from a potentially huge set of functions available before performing the analysis.</p></li> <li><p><em>automatically</em> optimising operations internally, and very effectively, by knowing precisely the data required for each operation, leading to very fast and memory efficient code.</p></li> </ul> <p>Briefly, if you are interested in reducing <em>programming</em> and <em>compute</em> time tremendously, then this package is for you. The philosophy that <code>data.table</code> adheres to makes this possible. Our goal is to illustrate it through this series of vignettes.</p> </div> <div id="data" class="section level2"> <h2>Data</h2> <p>In this vignette, we will use <a href="https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv">NYC-flights14</a> data obtained by <a href="https://github.com/arunsrinivasan/flights">flights</a> package (available on GitHub only). It contains On-Time flights data from the Bureau of Transporation Statistics for all the flights that departed from New York City airports in 2014 (inspired by <a href="https://github.com/tidyverse/nycflights13">nycflights13</a>). The data is available only for Jan-Oct’14.</p> <p>We can use <code>data.table</code>’s fast-and-friendly file reader <code>fread</code> to load <code>flights</code> directly as follows:</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true"></a>input <-<span class="st"> </span><span class="cf">if</span> (<span class="kw">file.exists</span>(<span class="st">"flights14.csv"</span>)) {</span> <span id="cb1-2"><a href="#cb1-2" aria-hidden="true"></a> <span class="st">"flights14.csv"</span></span> <span id="cb1-3"><a href="#cb1-3" aria-hidden="true"></a>} <span class="cf">else</span> {</span> <span id="cb1-4"><a href="#cb1-4" aria-hidden="true"></a> <span class="st">"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"</span></span> <span id="cb1-5"><a href="#cb1-5" aria-hidden="true"></a>}</span> <span id="cb1-6"><a href="#cb1-6" aria-hidden="true"></a>flights <-<span class="st"> </span><span class="kw">fread</span>(input)</span> <span id="cb1-7"><a href="#cb1-7" aria-hidden="true"></a>flights</span> <span id="cb1-8"><a href="#cb1-8" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb1-9"><a href="#cb1-9" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb1-10"><a href="#cb1-10" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb1-11"><a href="#cb1-11" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb1-12"><a href="#cb1-12" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb1-13"><a href="#cb1-13" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb1-14"><a href="#cb1-14" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb1-15"><a href="#cb1-15" aria-hidden="true"></a><span class="co"># 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14</span></span> <span id="cb1-16"><a href="#cb1-16" aria-hidden="true"></a><span class="co"># 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8</span></span> <span id="cb1-17"><a href="#cb1-17" aria-hidden="true"></a><span class="co"># 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11</span></span> <span id="cb1-18"><a href="#cb1-18" aria-hidden="true"></a><span class="co"># 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11</span></span> <span id="cb1-19"><a href="#cb1-19" aria-hidden="true"></a><span class="co"># 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8</span></span> <span id="cb1-20"><a href="#cb1-20" aria-hidden="true"></a><span class="kw">dim</span>(flights)</span> <span id="cb1-21"><a href="#cb1-21" aria-hidden="true"></a><span class="co"># [1] 253316 11</span></span></code></pre></div> <p>Aside: <code>fread</code> accepts <code>http</code> and <code>https</code> URLs directly as well as operating system commands such as <code>sed</code> and <code>awk</code> output. See <code>?fread</code> for examples.</p> </div> <div id="introduction" class="section level2"> <h2>Introduction</h2> <p>In this vignette, we will</p> <ol style="list-style-type: decimal"> <li><p>Start with basics - what is a <code>data.table</code>, its general form, how to <em>subset</em> rows, how to <em>select and compute</em> on columns;</p></li> <li><p>Then we will look at performing data aggregations by group</p></li> </ol> </div> <div id="basics-1" class="section level2"> <h2>1. Basics</h2> <div id="what-is-datatable-1a" class="section level3"> <h3>a) What is <code>data.table</code>?</h3> <p><code>data.table</code> is an R package that provides <strong>an enhanced version</strong> of <code>data.frame</code>s, which are the standard data structure for storing data in <code>base</code> R. In the <a href="#data">Data</a> section above, we already created a <code>data.table</code> using <code>fread()</code>. We can also create one using the <code>data.table()</code> function. Here is an example:</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true"></a>DT =<span class="st"> </span><span class="kw">data.table</span>(</span> <span id="cb2-2"><a href="#cb2-2" aria-hidden="true"></a> <span class="dt">ID =</span> <span class="kw">c</span>(<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"b"</span>,<span class="st">"a"</span>,<span class="st">"a"</span>,<span class="st">"c"</span>),</span> <span id="cb2-3"><a href="#cb2-3" aria-hidden="true"></a> <span class="dt">a =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">6</span>,</span> <span id="cb2-4"><a href="#cb2-4" aria-hidden="true"></a> <span class="dt">b =</span> <span class="dv">7</span><span class="op">:</span><span class="dv">12</span>,</span> <span id="cb2-5"><a href="#cb2-5" aria-hidden="true"></a> <span class="dt">c =</span> <span class="dv">13</span><span class="op">:</span><span class="dv">18</span></span> <span id="cb2-6"><a href="#cb2-6" aria-hidden="true"></a>)</span> <span id="cb2-7"><a href="#cb2-7" aria-hidden="true"></a>DT</span> <span id="cb2-8"><a href="#cb2-8" aria-hidden="true"></a><span class="co"># ID a b c</span></span> <span id="cb2-9"><a href="#cb2-9" aria-hidden="true"></a><span class="co"># 1: b 1 7 13</span></span> <span id="cb2-10"><a href="#cb2-10" aria-hidden="true"></a><span class="co"># 2: b 2 8 14</span></span> <span id="cb2-11"><a href="#cb2-11" aria-hidden="true"></a><span class="co"># 3: b 3 9 15</span></span> <span id="cb2-12"><a href="#cb2-12" aria-hidden="true"></a><span class="co"># 4: a 4 10 16</span></span> <span id="cb2-13"><a href="#cb2-13" aria-hidden="true"></a><span class="co"># 5: a 5 11 17</span></span> <span id="cb2-14"><a href="#cb2-14" aria-hidden="true"></a><span class="co"># 6: c 6 12 18</span></span> <span id="cb2-15"><a href="#cb2-15" aria-hidden="true"></a><span class="kw">class</span>(DT<span class="op">$</span>ID)</span> <span id="cb2-16"><a href="#cb2-16" aria-hidden="true"></a><span class="co"># [1] "character"</span></span></code></pre></div> <p>You can also convert existing objects to a <code>data.table</code> using <code>setDT()</code> (for <code>data.frame</code>s and <code>list</code>s) and <code>as.data.table()</code> (for other structures); the difference is beyond the scope of this vignette, see <code>?setDT</code> and <code>?as.data.table</code> for more details.</p> <div id="note-that" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Note that:</h4> <ul> <li><p>Unlike <code>data.frame</code>s, columns of <code>character</code> type are <em>never</em> converted to <code>factors</code> by default.</p></li> <li><p>Row numbers are printed with a <code>:</code> in order to visually separate the row number from the first column.</p></li> <li><p>When the number of rows to print exceeds the global option <code>datatable.print.nrows</code> (default = 100), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the <a href="#data">Data</a> section). If you’ve had a lot of experience with <code>data.frame</code>s, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. You can query the default number like so:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true"></a><span class="kw">getOption</span>(<span class="st">"datatable.print.nrows"</span>)</span></code></pre></div></li> <li><p><code>data.table</code> doesn’t set or use <em>row names</em>, ever. We will see why in the <em>“Keys and fast binary search based subset”</em> vignette.</p></li> </ul> </div> </div> <div id="enhanced-1b" class="section level3"> <h3>b) General form - in what way is a <code>data.table</code> <em>enhanced</em>?</h3> <p>In contrast to a <code>data.frame</code>, you can do <em>a lot more</em> than just subsetting rows and selecting columns within the frame of a <code>data.table</code>, i.e., within <code>[ ... ]</code> (NB: we might also refer to writing things inside <code>DT[...]</code> as “querying <code>DT</code>”, in analogy to SQL). To understand it we will have to first look at the <em>general form</em> of <code>data.table</code> syntax, as shown below:</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true"></a>DT[i, j, by]</span> <span id="cb4-2"><a href="#cb4-2" aria-hidden="true"></a></span> <span id="cb4-3"><a href="#cb4-3" aria-hidden="true"></a><span class="co">## R: i j by</span></span> <span id="cb4-4"><a href="#cb4-4" aria-hidden="true"></a><span class="co">## SQL: where | order by select | update group by</span></span></code></pre></div> <p>Users who have an SQL background might perhaps immediately relate to this syntax.</p> <div id="the-way-to-read-it-out-loud-is" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">The way to read it (out loud) is:</h4> <p>Take <code>DT</code>, subset/reorder rows using <code>i</code>, then calculate <code>j</code>, grouped by <code>by</code>.</p> <p>Let’s begin by looking at <code>i</code> and <code>j</code> first - subsetting rows and operating on columns.</p> </div> </div> <div id="subset-i-1c" class="section level3"> <h3>c) Subset rows in <code>i</code></h3> <div id="get-all-the-flights-with-jfk-as-the-origin-airport-in-the-month-of-june." class="section level4"> <h4>– Get all the flights with “JFK” as the origin airport in the month of June.</h4> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span>6L]</span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb5-3"><a href="#cb5-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb5-4"><a href="#cb5-4" aria-hidden="true"></a><span class="co"># 1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8</span></span> <span id="cb5-5"><a href="#cb5-5" aria-hidden="true"></a><span class="co"># 2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12</span></span> <span id="cb5-6"><a href="#cb5-6" aria-hidden="true"></a><span class="co"># 3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7</span></span> <span id="cb5-7"><a href="#cb5-7" aria-hidden="true"></a><span class="co"># 4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10</span></span> <span id="cb5-8"><a href="#cb5-8" aria-hidden="true"></a><span class="co"># 5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18</span></span> <span id="cb5-9"><a href="#cb5-9" aria-hidden="true"></a><span class="co"># 6: 2014 6 1 -6 -23 AA JFK LAX 329 2475 14</span></span></code></pre></div> </div> <div id="section" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Within the frame of a <code>data.table</code>, columns can be referred to <em>as if they are variables</em>, much like in SQL or Stata. Therefore, we simply refer to <code>origin</code> and <code>month</code> as if they are variables. We do not need to add the prefix <code>flights$</code> each time. Nevertheless, using <code>flights$origin</code> and <code>flights$month</code> would work just fine.</p></li> <li><p>The <em>row indices</em> that satisfy the condition <code>origin == "JFK" & month == 6L</code> are computed, and since there is nothing else left to do, all columns from <code>flights</code> at rows corresponding to those <em>row indices</em> are simply returned as a <code>data.table</code>.</p></li> <li><p>A comma after the condition in <code>i</code> is not required. But <code>flights[origin == "JFK" & month == 6L, ]</code> would work just fine. In <code>data.frame</code>s, however, the comma is necessary.</p></li> </ul> </div> <div id="subset-rows-integer" class="section level4"> <h4>– Get the first two rows from <code>flights</code>.</h4> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[<span class="dv">1</span><span class="op">:</span><span class="dv">2</span>]</span> <span id="cb6-2"><a href="#cb6-2" aria-hidden="true"></a>ans</span> <span id="cb6-3"><a href="#cb6-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb6-4"><a href="#cb6-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb6-5"><a href="#cb6-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span></code></pre></div> </div> <div id="section-1" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li>In this case, there is no condition. The row indices are already provided in <code>i</code>. We therefore return a <code>data.table</code> with all columns from <code>flights</code> at rows for those <em>row indices</em>.</li> </ul> </div> <div id="sort-flights-first-by-column-origin-in-ascending-order-and-then-by-dest-in-descending-order" class="section level4"> <h4>– Sort <code>flights</code> first by column <code>origin</code> in <em>ascending</em> order, and then by <code>dest</code> in <em>descending</em> order:</h4> <p>We can use the R function <code>order()</code> to accomplish this.</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[<span class="kw">order</span>(origin, <span class="op">-</span>dest)]</span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb7-3"><a href="#cb7-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb7-4"><a href="#cb7-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 5 6 49 EV EWR XNA 195 1131 8</span></span> <span id="cb7-5"><a href="#cb7-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 6 7 13 EV EWR XNA 190 1131 8</span></span> <span id="cb7-6"><a href="#cb7-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8</span></span> <span id="cb7-7"><a href="#cb7-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8</span></span> <span id="cb7-8"><a href="#cb7-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 9 16 7 EV EWR XNA 181 1131 8</span></span> <span id="cb7-9"><a href="#cb7-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 13 66 66 EV EWR XNA 188 1131 9</span></span></code></pre></div> </div> <div id="order-is-internally-optimised" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"><code>order()</code> is internally optimised</h4> <ul> <li><p>We can use “-” on a <code>character</code> columns within the frame of a <code>data.table</code> to sort in decreasing order.</p></li> <li><p>In addition, <code>order(...)</code> within the frame of a <code>data.table</code> uses <code>data.table</code>’s internal fast radix order <code>forder()</code>. This sort provided such a compelling improvement over R’s <code>base::order</code> that the R project adopted the <code>data.table</code> algorithm as its default sort in 2016 for R 3.3.0, see <code>?sort</code> and the <a href="https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf">R Release NEWS</a>.</p></li> </ul> <p>We will discuss <code>data.table</code>’s fast order in more detail in the <em><code>data.table</code> internals</em> vignette.</p> </div> </div> <div id="select-j-1d" class="section level3"> <h3>d) Select column(s) in <code>j</code></h3> <div id="select-arr_delay-column-but-return-it-as-a-vector." class="section level4"> <h4>– Select <code>arr_delay</code> column, but return it as a <em>vector</em>.</h4> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, arr_delay]</span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true"></a><span class="co"># [1] 13 13 9 -26 1 0</span></span></code></pre></div> </div> <div id="section-2" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Since columns can be referred to as if they are variables within the frame of <code>data.table</code>s, we directly refer to the <em>variable</em> we want to subset. Since we want <em>all the rows</em>, we simply skip <code>i</code>.</p></li> <li><p>It returns <em>all</em> the rows for the column <code>arr_delay</code>.</p></li> </ul> </div> <div id="select-arr_delay-column-but-return-as-a-data.table-instead." class="section level4"> <h4>– Select <code>arr_delay</code> column, but return as a <code>data.table</code> instead.</h4> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="kw">list</span>(arr_delay)]</span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb9-3"><a href="#cb9-3" aria-hidden="true"></a><span class="co"># arr_delay</span></span> <span id="cb9-4"><a href="#cb9-4" aria-hidden="true"></a><span class="co"># 1: 13</span></span> <span id="cb9-5"><a href="#cb9-5" aria-hidden="true"></a><span class="co"># 2: 13</span></span> <span id="cb9-6"><a href="#cb9-6" aria-hidden="true"></a><span class="co"># 3: 9</span></span> <span id="cb9-7"><a href="#cb9-7" aria-hidden="true"></a><span class="co"># 4: -26</span></span> <span id="cb9-8"><a href="#cb9-8" aria-hidden="true"></a><span class="co"># 5: 1</span></span> <span id="cb9-9"><a href="#cb9-9" aria-hidden="true"></a><span class="co"># 6: 0</span></span></code></pre></div> </div> <div id="section-3" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We wrap the <em>variables</em> (column names) within <code>list()</code>, which ensures that a <code>data.table</code> is returned. In case of a single column name, not wrapping with <code>list()</code> returns a vector instead, as seen in the <a href="#select-j-1d">previous example</a>.</p></li> <li><p><code>data.table</code> also allows wrapping columns with <code>.()</code> instead of <code>list()</code>. It is an <em>alias</em> to <code>list()</code>; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer <code>.()</code> for conciseness, so we will continue to use <code>.()</code> hereafter.</p></li> </ul> <p><code>data.table</code>s (and <code>data.frame</code>s) are internally <code>list</code>s as well, with the stipulation that each element has the same length and the <code>list</code> has a <code>class</code> attribute. Allowing <code>j</code> to return a <code>list</code> enables converting and returning <code>data.table</code> very efficiently.</p> </div> <div id="tip-1" class="section level4 bs-callout bs-callout-warning"> <h4 class="bs-callout bs-callout-warning">Tip:</h4> <p>As long as <code>j-expression</code> returns a <code>list</code>, each element of the list will be converted to a column in the resulting <code>data.table</code>. This makes <code>j</code> quite powerful, as we will see shortly. It is also very important to understand this for when you’d like to make more complicated queries!!</p> </div> <div id="select-both-arr_delay-and-dep_delay-columns." class="section level4"> <h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns.</h4> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, .(arr_delay, dep_delay)]</span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true"></a><span class="co"># arr_delay dep_delay</span></span> <span id="cb10-4"><a href="#cb10-4" aria-hidden="true"></a><span class="co"># 1: 13 14</span></span> <span id="cb10-5"><a href="#cb10-5" aria-hidden="true"></a><span class="co"># 2: 13 -3</span></span> <span id="cb10-6"><a href="#cb10-6" aria-hidden="true"></a><span class="co"># 3: 9 2</span></span> <span id="cb10-7"><a href="#cb10-7" aria-hidden="true"></a><span class="co"># 4: -26 -8</span></span> <span id="cb10-8"><a href="#cb10-8" aria-hidden="true"></a><span class="co"># 5: 1 2</span></span> <span id="cb10-9"><a href="#cb10-9" aria-hidden="true"></a><span class="co"># 6: 0 4</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true"></a></span> <span id="cb10-11"><a href="#cb10-11" aria-hidden="true"></a><span class="co">## alternatively</span></span> <span id="cb10-12"><a href="#cb10-12" aria-hidden="true"></a><span class="co"># ans <- flights[, list(arr_delay, dep_delay)]</span></span></code></pre></div> </div> <div id="section-4" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li>Wrap both columns within <code>.()</code>, or <code>list()</code>. That’s it.</li> </ul> </div> <div id="select-both-arr_delay-and-dep_delay-columns-and-rename-them-to-delay_arr-and-delay_dep." class="section level4"> <h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns <em>and</em> rename them to <code>delay_arr</code> and <code>delay_dep</code>.</h4> <p>Since <code>.()</code> is just an alias for <code>list()</code>, we can name columns as we would while creating a <code>list</code>.</p> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, .(<span class="dt">delay_arr =</span> arr_delay, <span class="dt">delay_dep =</span> dep_delay)]</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true"></a><span class="co"># delay_arr delay_dep</span></span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true"></a><span class="co"># 1: 13 14</span></span> <span id="cb11-5"><a href="#cb11-5" aria-hidden="true"></a><span class="co"># 2: 13 -3</span></span> <span id="cb11-6"><a href="#cb11-6" aria-hidden="true"></a><span class="co"># 3: 9 2</span></span> <span id="cb11-7"><a href="#cb11-7" aria-hidden="true"></a><span class="co"># 4: -26 -8</span></span> <span id="cb11-8"><a href="#cb11-8" aria-hidden="true"></a><span class="co"># 5: 1 2</span></span> <span id="cb11-9"><a href="#cb11-9" aria-hidden="true"></a><span class="co"># 6: 0 4</span></span></code></pre></div> <p>That’s it.</p> </div> </div> <div id="e-compute-or-do-in-j" class="section level3"> <h3>e) Compute or <em>do</em> in <code>j</code></h3> <div id="how-many-trips-have-had-total-delay-0" class="section level4"> <h4>– How many trips have had total delay < 0?</h4> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="kw">sum</span>( (arr_delay <span class="op">+</span><span class="st"> </span>dep_delay) <span class="op"><</span><span class="st"> </span><span class="dv">0</span> )]</span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true"></a>ans</span> <span id="cb12-3"><a href="#cb12-3" aria-hidden="true"></a><span class="co"># [1] 141814</span></span></code></pre></div> </div> <div id="whats-happening-here" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">What’s happening here?</h4> <ul> <li><code>data.table</code>’s <code>j</code> can handle more than just <em>selecting columns</em> - it can handle <em>expressions</em>, i.e., <em>computing on columns</em>. This shouldn’t be surprising, as columns can be referred to as if they are variables. Then we should be able to <em>compute</em> by calling functions on those variables. And that’s what precisely happens here.</li> </ul> </div> </div> <div id="f-subset-in-i-and-do-in-j" class="section level3"> <h3>f) Subset in <code>i</code> <em>and</em> do in <code>j</code></h3> <div id="calculate-the-average-arrival-and-departure-delay-for-all-flights-with-jfk-as-the-origin-airport-in-the-month-of-june." class="section level4"> <h4>– Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.</h4> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span>6L,</span> <span id="cb13-2"><a href="#cb13-2" aria-hidden="true"></a> .(<span class="dt">m_arr =</span> <span class="kw">mean</span>(arr_delay), <span class="dt">m_dep =</span> <span class="kw">mean</span>(dep_delay))]</span> <span id="cb13-3"><a href="#cb13-3" aria-hidden="true"></a>ans</span> <span id="cb13-4"><a href="#cb13-4" aria-hidden="true"></a><span class="co"># m_arr m_dep</span></span> <span id="cb13-5"><a href="#cb13-5" aria-hidden="true"></a><span class="co"># 1: 5.839349 9.807884</span></span></code></pre></div> </div> <div id="section-5" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We first subset in <code>i</code> to find matching <em>row indices</em> where <code>origin</code> airport equals <code>"JFK"</code>, and <code>month</code> equals <code>6L</code>. We <em>do not</em> subset the <em>entire</em> <code>data.table</code> corresponding to those rows <em>yet</em>.</p></li> <li><p>Now, we look at <code>j</code> and find that it uses only <em>two columns</em>. And what we have to do is to compute their <code>mean()</code>. Therefore we subset just those columns corresponding to the matching rows, and compute their <code>mean()</code>.</p></li> </ul> <p>Because the three main components of the query (<code>i</code>, <code>j</code> and <code>by</code>) are <em>together</em> inside <code>[...]</code>, <code>data.table</code> can see all three and optimise the query altogether <em>before evaluation</em>, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns <em>besides</em> <code>arr_delay</code> and <code>dep_delay</code>), for both speed and memory efficiency.</p> </div> <div id="how-many-trips-have-been-made-in-2014-from-jfk-airport-in-the-month-of-june" class="section level4"> <h4>– How many trips have been made in 2014 from “JFK” airport in the month of June?</h4> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span>6L, <span class="kw">length</span>(dest)]</span> <span id="cb14-2"><a href="#cb14-2" aria-hidden="true"></a>ans</span> <span id="cb14-3"><a href="#cb14-3" aria-hidden="true"></a><span class="co"># [1] 8422</span></span></code></pre></div> <p>The function <code>length()</code> requires an input argument. We just needed to compute the number of rows in the subset. We could have used any other column as input argument to <code>length()</code> really. This approach is reminiscent of <code>SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6</code> in SQL.</p> <p>This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where <code>data.table</code> provides a <em>special symbol</em> <code>.N</code> for it.</p> </div> <div id="special-N" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Special symbol <code>.N</code>:</h4> <p><code>.N</code> is a special built-in variable that holds the number of observations <em>in the current group</em>. It is particularly useful when combined with <code>by</code> as we’ll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.</p> <p>So we can now accomplish the same task by using <code>.N</code> as follows:</p> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>month <span class="op">==</span><span class="st"> </span>6L, .N]</span> <span id="cb15-2"><a href="#cb15-2" aria-hidden="true"></a>ans</span> <span id="cb15-3"><a href="#cb15-3" aria-hidden="true"></a><span class="co"># [1] 8422</span></span></code></pre></div> </div> <div id="section-6" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Once again, we subset in <code>i</code> to get the <em>row indices</em> where <code>origin</code> airport equals <em>“JFK”</em>, and <code>month</code> equals <em>6</em>.</p></li> <li><p>We see that <code>j</code> uses only <code>.N</code> and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices).</p></li> <li><p>Note that we did not wrap <code>.N</code> with <code>list()</code> or <code>.()</code>. Therefore a vector is returned.</p></li> </ul> <p>We could have accomplished the same operation by doing <code>nrow(flights[origin == "JFK" & month == 6L])</code>. However, it would have to subset the entire <code>data.table</code> first corresponding to the <em>row indices</em> in <code>i</code> <em>and then</em> return the rows using <code>nrow()</code>, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the <em><code>data.table</code> design</em> vignette.</p> </div> </div> <div id="refer_j" class="section level3"> <h3>g) Great! But how can I refer to columns by names in <code>j</code> (like in a <code>data.frame</code>)?</h3> <p>If you’re writing out the column names explicitly, there’s no difference vis-a-vis <code>data.frame</code> (since v1.9.8).</p> <div id="select-both-arr_delay-and-dep_delay-columns-the-data.frame-way." class="section level4"> <h4>– Select both <code>arr_delay</code> and <code>dep_delay</code> columns the <code>data.frame</code> way.</h4> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)]</span> <span id="cb16-2"><a href="#cb16-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb16-3"><a href="#cb16-3" aria-hidden="true"></a><span class="co"># arr_delay dep_delay</span></span> <span id="cb16-4"><a href="#cb16-4" aria-hidden="true"></a><span class="co"># 1: 13 14</span></span> <span id="cb16-5"><a href="#cb16-5" aria-hidden="true"></a><span class="co"># 2: 13 -3</span></span> <span id="cb16-6"><a href="#cb16-6" aria-hidden="true"></a><span class="co"># 3: 9 2</span></span> <span id="cb16-7"><a href="#cb16-7" aria-hidden="true"></a><span class="co"># 4: -26 -8</span></span> <span id="cb16-8"><a href="#cb16-8" aria-hidden="true"></a><span class="co"># 5: 1 2</span></span> <span id="cb16-9"><a href="#cb16-9" aria-hidden="true"></a><span class="co"># 6: 0 4</span></span></code></pre></div> <p>If you’ve stored the desired columns in a character vector, there are two options: Using the <code>..</code> prefix, or using the <code>with</code> argument.</p> </div> <div id="select-columns-named-in-a-variable-using-the-..-prefix" class="section level4"> <h4>– Select columns named in a variable using the <code>..</code> prefix</h4> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true"></a>select_cols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)</span> <span id="cb17-2"><a href="#cb17-2" aria-hidden="true"></a>flights[ , ..select_cols]</span> <span id="cb17-3"><a href="#cb17-3" aria-hidden="true"></a><span class="co"># arr_delay dep_delay</span></span> <span id="cb17-4"><a href="#cb17-4" aria-hidden="true"></a><span class="co"># 1: 13 14</span></span> <span id="cb17-5"><a href="#cb17-5" aria-hidden="true"></a><span class="co"># 2: 13 -3</span></span> <span id="cb17-6"><a href="#cb17-6" aria-hidden="true"></a><span class="co"># 3: 9 2</span></span> <span id="cb17-7"><a href="#cb17-7" aria-hidden="true"></a><span class="co"># 4: -26 -8</span></span> <span id="cb17-8"><a href="#cb17-8" aria-hidden="true"></a><span class="co"># 5: 1 2</span></span> <span id="cb17-9"><a href="#cb17-9" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb17-10"><a href="#cb17-10" aria-hidden="true"></a><span class="co"># 253312: -30 1</span></span> <span id="cb17-11"><a href="#cb17-11" aria-hidden="true"></a><span class="co"># 253313: -14 -5</span></span> <span id="cb17-12"><a href="#cb17-12" aria-hidden="true"></a><span class="co"># 253314: 16 -8</span></span> <span id="cb17-13"><a href="#cb17-13" aria-hidden="true"></a><span class="co"># 253315: 15 -4</span></span> <span id="cb17-14"><a href="#cb17-14" aria-hidden="true"></a><span class="co"># 253316: 1 -5</span></span></code></pre></div> <p>For those familiar with the Unix terminal, the <code>..</code> prefix should be reminiscent of the “up-one-level” command, which is analogous to what’s happening here – the <code>..</code> signals to <code>data.table</code> to look for the <code>select_cols</code> variable “up-one-level”, i.e., in the global environment in this case.</p> </div> <div id="select-columns-named-in-a-variable-using-with-false" class="section level4"> <h4>– Select columns named in a variable using <code>with = FALSE</code></h4> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true"></a>flights[ , select_cols, with =<span class="st"> </span><span class="ot">FALSE</span>]</span> <span id="cb18-2"><a href="#cb18-2" aria-hidden="true"></a><span class="co"># arr_delay dep_delay</span></span> <span id="cb18-3"><a href="#cb18-3" aria-hidden="true"></a><span class="co"># 1: 13 14</span></span> <span id="cb18-4"><a href="#cb18-4" aria-hidden="true"></a><span class="co"># 2: 13 -3</span></span> <span id="cb18-5"><a href="#cb18-5" aria-hidden="true"></a><span class="co"># 3: 9 2</span></span> <span id="cb18-6"><a href="#cb18-6" aria-hidden="true"></a><span class="co"># 4: -26 -8</span></span> <span id="cb18-7"><a href="#cb18-7" aria-hidden="true"></a><span class="co"># 5: 1 2</span></span> <span id="cb18-8"><a href="#cb18-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb18-9"><a href="#cb18-9" aria-hidden="true"></a><span class="co"># 253312: -30 1</span></span> <span id="cb18-10"><a href="#cb18-10" aria-hidden="true"></a><span class="co"># 253313: -14 -5</span></span> <span id="cb18-11"><a href="#cb18-11" aria-hidden="true"></a><span class="co"># 253314: 16 -8</span></span> <span id="cb18-12"><a href="#cb18-12" aria-hidden="true"></a><span class="co"># 253315: 15 -4</span></span> <span id="cb18-13"><a href="#cb18-13" aria-hidden="true"></a><span class="co"># 253316: 1 -5</span></span></code></pre></div> <p>The argument is named <code>with</code> after the R function <code>with()</code> because of similar functionality. Suppose you have a <code>data.frame</code> <code>DF</code> and you’d like to subset all rows where <code>x > 1</code>. In <code>base</code> R you can do the following:</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true"></a>DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">x =</span> <span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">3</span>,<span class="dv">3</span>), <span class="dt">y =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">8</span>)</span> <span id="cb19-2"><a href="#cb19-2" aria-hidden="true"></a></span> <span id="cb19-3"><a href="#cb19-3" aria-hidden="true"></a><span class="co">## (1) normal way</span></span> <span id="cb19-4"><a href="#cb19-4" aria-hidden="true"></a>DF[DF<span class="op">$</span>x <span class="op">></span><span class="st"> </span><span class="dv">1</span>, ] <span class="co"># data.frame needs that ',' as well</span></span> <span id="cb19-5"><a href="#cb19-5" aria-hidden="true"></a><span class="co"># x y</span></span> <span id="cb19-6"><a href="#cb19-6" aria-hidden="true"></a><span class="co"># 4 2 4</span></span> <span id="cb19-7"><a href="#cb19-7" aria-hidden="true"></a><span class="co"># 5 2 5</span></span> <span id="cb19-8"><a href="#cb19-8" aria-hidden="true"></a><span class="co"># 6 3 6</span></span> <span id="cb19-9"><a href="#cb19-9" aria-hidden="true"></a><span class="co"># 7 3 7</span></span> <span id="cb19-10"><a href="#cb19-10" aria-hidden="true"></a><span class="co"># 8 3 8</span></span> <span id="cb19-11"><a href="#cb19-11" aria-hidden="true"></a></span> <span id="cb19-12"><a href="#cb19-12" aria-hidden="true"></a><span class="co">## (2) using with</span></span> <span id="cb19-13"><a href="#cb19-13" aria-hidden="true"></a>DF[<span class="kw">with</span>(DF, x <span class="op">></span><span class="st"> </span><span class="dv">1</span>), ]</span> <span id="cb19-14"><a href="#cb19-14" aria-hidden="true"></a><span class="co"># x y</span></span> <span id="cb19-15"><a href="#cb19-15" aria-hidden="true"></a><span class="co"># 4 2 4</span></span> <span id="cb19-16"><a href="#cb19-16" aria-hidden="true"></a><span class="co"># 5 2 5</span></span> <span id="cb19-17"><a href="#cb19-17" aria-hidden="true"></a><span class="co"># 6 3 6</span></span> <span id="cb19-18"><a href="#cb19-18" aria-hidden="true"></a><span class="co"># 7 3 7</span></span> <span id="cb19-19"><a href="#cb19-19" aria-hidden="true"></a><span class="co"># 8 3 8</span></span></code></pre></div> <ul> <li><p>Using <code>with()</code> in (2) allows using <code>DF</code>’s column <code>x</code> as if it were a variable.</p> <p>Hence the argument name <code>with</code> in <code>data.table</code>. Setting <code>with = FALSE</code> disables the ability to refer to columns as if they are variables, thereby restoring the “<code>data.frame</code> mode”.</p></li> <li><p>We can also <em>deselect</em> columns using <code>-</code> or <code>!</code>. For example:</p> <div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true"></a><span class="co">## not run</span></span> <span id="cb20-2"><a href="#cb20-2" aria-hidden="true"></a></span> <span id="cb20-3"><a href="#cb20-3" aria-hidden="true"></a><span class="co"># returns all columns except arr_delay and dep_delay</span></span> <span id="cb20-4"><a href="#cb20-4" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="op">!</span><span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)]</span> <span id="cb20-5"><a href="#cb20-5" aria-hidden="true"></a><span class="co"># or</span></span> <span id="cb20-6"><a href="#cb20-6" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="op">-</span><span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)]</span></code></pre></div></li> <li><p>From <code>v1.9.5+</code>, we can also select by specifying start and end column names, e.g., <code>year:day</code> to select the first three columns.</p> <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true"></a><span class="co">## not run</span></span> <span id="cb21-2"><a href="#cb21-2" aria-hidden="true"></a></span> <span id="cb21-3"><a href="#cb21-3" aria-hidden="true"></a><span class="co"># returns year,month and day</span></span> <span id="cb21-4"><a href="#cb21-4" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, year<span class="op">:</span>day]</span> <span id="cb21-5"><a href="#cb21-5" aria-hidden="true"></a><span class="co"># returns day, month and year</span></span> <span id="cb21-6"><a href="#cb21-6" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, day<span class="op">:</span>year]</span> <span id="cb21-7"><a href="#cb21-7" aria-hidden="true"></a><span class="co"># returns all columns except year, month and day</span></span> <span id="cb21-8"><a href="#cb21-8" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="op">-</span>(year<span class="op">:</span>day)]</span> <span id="cb21-9"><a href="#cb21-9" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="op">!</span>(year<span class="op">:</span>day)]</span></code></pre></div> <p>This is particularly handy while working interactively.</p></li> </ul> <p><code>with = TRUE</code> is the default in <code>data.table</code> because we can do much more by allowing <code>j</code> to handle expressions - especially when combined with <code>by</code>, as we’ll see in a moment.</p> </div> </div> </div> <div id="aggregations" class="section level2"> <h2>2. Aggregations</h2> <p>We’ve already seen <code>i</code> and <code>j</code> from <code>data.table</code>’s general form in the previous section. In this section, we’ll see how they can be combined together with <code>by</code> to perform operations <em>by group</em>. Let’s look at some examples.</p> <div id="a-grouping-using-by" class="section level3"> <h3>a) Grouping using <code>by</code></h3> <div id="how-can-we-get-the-number-of-trips-corresponding-to-each-origin-airport" class="section level4"> <h4>– How can we get the number of trips corresponding to each origin airport?</h4> <div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, .(.N), by =<span class="st"> </span>.(origin)]</span> <span id="cb22-2"><a href="#cb22-2" aria-hidden="true"></a>ans</span> <span id="cb22-3"><a href="#cb22-3" aria-hidden="true"></a><span class="co"># origin N</span></span> <span id="cb22-4"><a href="#cb22-4" aria-hidden="true"></a><span class="co"># 1: JFK 81483</span></span> <span id="cb22-5"><a href="#cb22-5" aria-hidden="true"></a><span class="co"># 2: LGA 84433</span></span> <span id="cb22-6"><a href="#cb22-6" aria-hidden="true"></a><span class="co"># 3: EWR 87400</span></span> <span id="cb22-7"><a href="#cb22-7" aria-hidden="true"></a></span> <span id="cb22-8"><a href="#cb22-8" aria-hidden="true"></a><span class="co">## or equivalently using a character vector in 'by'</span></span> <span id="cb22-9"><a href="#cb22-9" aria-hidden="true"></a><span class="co"># ans <- flights[, .(.N), by = "origin"]</span></span></code></pre></div> </div> <div id="section-7" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We know <code>.N</code> <a href="#special-N">is a special variable</a> that holds the number of rows in the current group. Grouping by <code>origin</code> obtains the number of rows, <code>.N</code>, for each group.</p></li> <li><p>By doing <code>head(flights)</code> you can see that the origin airports occur in the order <em>“JFK”</em>, <em>“LGA”</em> and <em>“EWR”</em>. The original order of grouping variables is preserved in the result. <em>This is important to keep in mind!</em></p></li> <li><p>Since we did not provide a name for the column returned in <code>j</code>, it was named <code>N</code> automatically by recognising the special symbol <code>.N</code>.</p></li> <li><p><code>by</code> also accepts a character vector of column names. This is particularly useful for coding programmatically, e.g., designing a function with the grouping columns as a (<code>character</code> vector) function argument.</p></li> <li><p>When there’s only one column or expression to refer to in <code>j</code> and <code>by</code>, we can drop the <code>.()</code> notation. This is purely for convenience. We could instead do:</p> <div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, .N, by =<span class="st"> </span>origin]</span> <span id="cb23-2"><a href="#cb23-2" aria-hidden="true"></a>ans</span> <span id="cb23-3"><a href="#cb23-3" aria-hidden="true"></a><span class="co"># origin N</span></span> <span id="cb23-4"><a href="#cb23-4" aria-hidden="true"></a><span class="co"># 1: JFK 81483</span></span> <span id="cb23-5"><a href="#cb23-5" aria-hidden="true"></a><span class="co"># 2: LGA 84433</span></span> <span id="cb23-6"><a href="#cb23-6" aria-hidden="true"></a><span class="co"># 3: EWR 87400</span></span></code></pre></div> <p>We’ll use this convenient form wherever applicable hereafter.</p></li> </ul> </div> <div id="origin-.N" class="section level4"> <h4>– How can we calculate the number of trips for each origin airport for carrier code <code>"AA"</code>?</h4> <p>The unique carrier code <code>"AA"</code> corresponds to <em>American Airlines Inc.</em></p> <div class="sourceCode" id="cb24"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>, .N, by =<span class="st"> </span>origin]</span> <span id="cb24-2"><a href="#cb24-2" aria-hidden="true"></a>ans</span> <span id="cb24-3"><a href="#cb24-3" aria-hidden="true"></a><span class="co"># origin N</span></span> <span id="cb24-4"><a href="#cb24-4" aria-hidden="true"></a><span class="co"># 1: JFK 11923</span></span> <span id="cb24-5"><a href="#cb24-5" aria-hidden="true"></a><span class="co"># 2: LGA 11730</span></span> <span id="cb24-6"><a href="#cb24-6" aria-hidden="true"></a><span class="co"># 3: EWR 2649</span></span></code></pre></div> </div> <div id="section-8" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We first obtain the row indices for the expression <code>carrier == "AA"</code> from <code>i</code>.</p></li> <li><p>Using those <em>row indices</em>, we obtain the number of rows while grouped by <code>origin</code>. Once again no columns are actually materialised here, because the <code>j-expression</code> does not require any columns to be actually subsetted and is therefore fast and memory efficient.</p></li> </ul> </div> <div id="origin-dest-.N" class="section level4"> <h4>– How can we get the total number of trips for each <code>origin, dest</code> pair for carrier code <code>"AA"</code>?</h4> <div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin, dest)]</span> <span id="cb25-2"><a href="#cb25-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb25-3"><a href="#cb25-3" aria-hidden="true"></a><span class="co"># origin dest N</span></span> <span id="cb25-4"><a href="#cb25-4" aria-hidden="true"></a><span class="co"># 1: JFK LAX 3387</span></span> <span id="cb25-5"><a href="#cb25-5" aria-hidden="true"></a><span class="co"># 2: LGA PBI 245</span></span> <span id="cb25-6"><a href="#cb25-6" aria-hidden="true"></a><span class="co"># 3: EWR LAX 62</span></span> <span id="cb25-7"><a href="#cb25-7" aria-hidden="true"></a><span class="co"># 4: JFK MIA 1876</span></span> <span id="cb25-8"><a href="#cb25-8" aria-hidden="true"></a><span class="co"># 5: JFK SEA 298</span></span> <span id="cb25-9"><a href="#cb25-9" aria-hidden="true"></a><span class="co"># 6: EWR MIA 848</span></span> <span id="cb25-10"><a href="#cb25-10" aria-hidden="true"></a></span> <span id="cb25-11"><a href="#cb25-11" aria-hidden="true"></a><span class="co">## or equivalently using a character vector in 'by'</span></span> <span id="cb25-12"><a href="#cb25-12" aria-hidden="true"></a><span class="co"># ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]</span></span></code></pre></div> </div> <div id="section-9" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><code>by</code> accepts multiple columns. We just provide all the columns by which to group by. Note the use of <code>.()</code> again in <code>by</code> – again, this is just shorthand for <code>list()</code>, and <code>list()</code> can be used here as well. Again, we’ll stick with <code>.()</code> in this vignette.</li> </ul> </div> <div id="origin-dest-month" class="section level4"> <h4>– How can we get the average arrival and departure delay for each <code>orig,dest</code> pair for each month for carrier code <code>"AA"</code>?</h4> <div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>,</span> <span id="cb26-2"><a href="#cb26-2" aria-hidden="true"></a> .(<span class="kw">mean</span>(arr_delay), <span class="kw">mean</span>(dep_delay)),</span> <span id="cb26-3"><a href="#cb26-3" aria-hidden="true"></a> by =<span class="st"> </span>.(origin, dest, month)]</span> <span id="cb26-4"><a href="#cb26-4" aria-hidden="true"></a>ans</span> <span id="cb26-5"><a href="#cb26-5" aria-hidden="true"></a><span class="co"># origin dest month V1 V2</span></span> <span id="cb26-6"><a href="#cb26-6" aria-hidden="true"></a><span class="co"># 1: JFK LAX 1 6.590361 14.2289157</span></span> <span id="cb26-7"><a href="#cb26-7" aria-hidden="true"></a><span class="co"># 2: LGA PBI 1 -7.758621 0.3103448</span></span> <span id="cb26-8"><a href="#cb26-8" aria-hidden="true"></a><span class="co"># 3: EWR LAX 1 1.366667 7.5000000</span></span> <span id="cb26-9"><a href="#cb26-9" aria-hidden="true"></a><span class="co"># 4: JFK MIA 1 15.720670 18.7430168</span></span> <span id="cb26-10"><a href="#cb26-10" aria-hidden="true"></a><span class="co"># 5: JFK SEA 1 14.357143 30.7500000</span></span> <span id="cb26-11"><a href="#cb26-11" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb26-12"><a href="#cb26-12" aria-hidden="true"></a><span class="co"># 196: LGA MIA 10 -6.251799 -1.4208633</span></span> <span id="cb26-13"><a href="#cb26-13" aria-hidden="true"></a><span class="co"># 197: JFK MIA 10 -1.880184 6.6774194</span></span> <span id="cb26-14"><a href="#cb26-14" aria-hidden="true"></a><span class="co"># 198: EWR PHX 10 -3.032258 -4.2903226</span></span> <span id="cb26-15"><a href="#cb26-15" aria-hidden="true"></a><span class="co"># 199: JFK MCO 10 -10.048387 -1.6129032</span></span> <span id="cb26-16"><a href="#cb26-16" aria-hidden="true"></a><span class="co"># 200: JFK DCA 10 16.483871 15.5161290</span></span></code></pre></div> </div> <div id="section-10" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Since we did not provide column names for the expressions in <code>j</code>, they were automatically generated as <code>V1</code> and <code>V2</code>.</p></li> <li><p>Once again, note that the input order of grouping columns is preserved in the result.</p></li> </ul> <p>Now what if we would like to order the result by those grouping columns <code>origin</code>, <code>dest</code> and <code>month</code>?</p> </div> </div> <div id="b-sorted-by-keyby" class="section level3"> <h3>b) Sorted <code>by</code>: <code>keyby</code></h3> <p><code>data.table</code> retaining the original order of groups is intentional and by design. There are cases when preserving the original order is essential. But at times we would like to automatically sort by the variables in our grouping.</p> <div id="so-how-can-we-directly-order-by-all-the-grouping-variables" class="section level4"> <h4>– So how can we directly order by all the grouping variables?</h4> <div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>,</span> <span id="cb27-2"><a href="#cb27-2" aria-hidden="true"></a> .(<span class="kw">mean</span>(arr_delay), <span class="kw">mean</span>(dep_delay)),</span> <span id="cb27-3"><a href="#cb27-3" aria-hidden="true"></a> keyby =<span class="st"> </span>.(origin, dest, month)]</span> <span id="cb27-4"><a href="#cb27-4" aria-hidden="true"></a>ans</span> <span id="cb27-5"><a href="#cb27-5" aria-hidden="true"></a><span class="co"># origin dest month V1 V2</span></span> <span id="cb27-6"><a href="#cb27-6" aria-hidden="true"></a><span class="co"># 1: EWR DFW 1 6.427673 10.0125786</span></span> <span id="cb27-7"><a href="#cb27-7" aria-hidden="true"></a><span class="co"># 2: EWR DFW 2 10.536765 11.3455882</span></span> <span id="cb27-8"><a href="#cb27-8" aria-hidden="true"></a><span class="co"># 3: EWR DFW 3 12.865031 8.0797546</span></span> <span id="cb27-9"><a href="#cb27-9" aria-hidden="true"></a><span class="co"># 4: EWR DFW 4 17.792683 12.9207317</span></span> <span id="cb27-10"><a href="#cb27-10" aria-hidden="true"></a><span class="co"># 5: EWR DFW 5 18.487805 18.6829268</span></span> <span id="cb27-11"><a href="#cb27-11" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb27-12"><a href="#cb27-12" aria-hidden="true"></a><span class="co"># 196: LGA PBI 1 -7.758621 0.3103448</span></span> <span id="cb27-13"><a href="#cb27-13" aria-hidden="true"></a><span class="co"># 197: LGA PBI 2 -7.865385 2.4038462</span></span> <span id="cb27-14"><a href="#cb27-14" aria-hidden="true"></a><span class="co"># 198: LGA PBI 3 -5.754098 3.0327869</span></span> <span id="cb27-15"><a href="#cb27-15" aria-hidden="true"></a><span class="co"># 199: LGA PBI 4 -13.966667 -4.7333333</span></span> <span id="cb27-16"><a href="#cb27-16" aria-hidden="true"></a><span class="co"># 200: LGA PBI 5 -10.357143 -6.8571429</span></span></code></pre></div> </div> <div id="section-11" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li>All we did was to change <code>by</code> to <code>keyby</code>. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation of <code>by</code> first requiring a sort before recovering the original table’s order, <code>keyby</code> is typically faster than <code>by</code> because it doesn’t require this second step.</li> </ul> <p><strong>Keys:</strong> Actually <code>keyby</code> does a little more than <em>just ordering</em>. It also <em>sets a key</em> after ordering by setting an <code>attribute</code> called <code>sorted</code>.</p> <p>We’ll learn more about <code>keys</code> in the <em>Keys and fast binary search based subset</em> vignette; for now, all you have to know is that you can use <code>keyby</code> to automatically order the result by the columns specified in <code>by</code>.</p> </div> </div> <div id="c-chaining" class="section level3"> <h3>c) Chaining</h3> <p>Let’s reconsider the task of <a href="#origin-dest-.N">getting the total number of trips for each <code>origin, dest</code> pair for carrier <em>“AA”</em></a>.</p> <div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin, dest)]</span></code></pre></div> <div id="how-can-we-order-ans-using-the-columns-origin-in-ascending-order-and-dest-in-descending-order" class="section level4"> <h4>– How can we order <code>ans</code> using the columns <code>origin</code> in ascending order, and <code>dest</code> in descending order?</h4> <p>We can store the intermediate result in a variable, and then use <code>order(origin, -dest)</code> on that variable. It seems fairly straightforward.</p> <div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true"></a>ans <-<span class="st"> </span>ans[<span class="kw">order</span>(origin, <span class="op">-</span>dest)]</span> <span id="cb29-2"><a href="#cb29-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb29-3"><a href="#cb29-3" aria-hidden="true"></a><span class="co"># origin dest N</span></span> <span id="cb29-4"><a href="#cb29-4" aria-hidden="true"></a><span class="co"># 1: EWR PHX 121</span></span> <span id="cb29-5"><a href="#cb29-5" aria-hidden="true"></a><span class="co"># 2: EWR MIA 848</span></span> <span id="cb29-6"><a href="#cb29-6" aria-hidden="true"></a><span class="co"># 3: EWR LAX 62</span></span> <span id="cb29-7"><a href="#cb29-7" aria-hidden="true"></a><span class="co"># 4: EWR DFW 1618</span></span> <span id="cb29-8"><a href="#cb29-8" aria-hidden="true"></a><span class="co"># 5: JFK STT 229</span></span> <span id="cb29-9"><a href="#cb29-9" aria-hidden="true"></a><span class="co"># 6: JFK SJU 690</span></span></code></pre></div> </div> <div id="section-12" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Recall that we can use <code>-</code> on a <code>character</code> column in <code>order()</code> within the frame of a <code>data.table</code>. This is possible to due <code>data.table</code>’s internal query optimisation.</p></li> <li><p>Also recall that <code>order(...)</code> with the frame of a <code>data.table</code> is <em>automatically optimised</em> to use <code>data.table</code>’s internal fast radix order <code>forder()</code> for speed.</p></li> </ul> <p>But this requires having to assign the intermediate result and then overwriting that result. We can do one better and avoid this intermediate assignment to a temporary variable altogether by <em>chaining</em> expressions.</p> <div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>, .N, by =<span class="st"> </span>.(origin, dest)][<span class="kw">order</span>(origin, <span class="op">-</span>dest)]</span> <span id="cb30-2"><a href="#cb30-2" aria-hidden="true"></a><span class="kw">head</span>(ans, <span class="dv">10</span>)</span> <span id="cb30-3"><a href="#cb30-3" aria-hidden="true"></a><span class="co"># origin dest N</span></span> <span id="cb30-4"><a href="#cb30-4" aria-hidden="true"></a><span class="co"># 1: EWR PHX 121</span></span> <span id="cb30-5"><a href="#cb30-5" aria-hidden="true"></a><span class="co"># 2: EWR MIA 848</span></span> <span id="cb30-6"><a href="#cb30-6" aria-hidden="true"></a><span class="co"># 3: EWR LAX 62</span></span> <span id="cb30-7"><a href="#cb30-7" aria-hidden="true"></a><span class="co"># 4: EWR DFW 1618</span></span> <span id="cb30-8"><a href="#cb30-8" aria-hidden="true"></a><span class="co"># 5: JFK STT 229</span></span> <span id="cb30-9"><a href="#cb30-9" aria-hidden="true"></a><span class="co"># 6: JFK SJU 690</span></span> <span id="cb30-10"><a href="#cb30-10" aria-hidden="true"></a><span class="co"># 7: JFK SFO 1312</span></span> <span id="cb30-11"><a href="#cb30-11" aria-hidden="true"></a><span class="co"># 8: JFK SEA 298</span></span> <span id="cb30-12"><a href="#cb30-12" aria-hidden="true"></a><span class="co"># 9: JFK SAN 299</span></span> <span id="cb30-13"><a href="#cb30-13" aria-hidden="true"></a><span class="co"># 10: JFK ORD 432</span></span></code></pre></div> </div> <div id="section-13" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>We can tack expressions one after another, <em>forming a chain</em> of operations, i.e., <code>DT[ ... ][ ... ][ ... ]</code>.</p></li> <li><p>Or you can also chain them vertically:</p> <div class="sourceCode" id="cb31"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true"></a>DT[ ...</span> <span id="cb31-2"><a href="#cb31-2" aria-hidden="true"></a> ][ ...</span> <span id="cb31-3"><a href="#cb31-3" aria-hidden="true"></a> ][ ...</span> <span id="cb31-4"><a href="#cb31-4" aria-hidden="true"></a> ]</span></code></pre></div></li> </ul> </div> </div> <div id="d-expressions-in-by" class="section level3"> <h3>d) Expressions in <code>by</code></h3> <div id="can-by-accept-expressions-as-well-or-does-it-just-take-columns" class="section level4"> <h4>– Can <code>by</code> accept <em>expressions</em> as well or does it just take columns?</h4> <p>Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc…</p> <div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, .N, .(dep_delay<span class="op">></span><span class="dv">0</span>, arr_delay<span class="op">></span><span class="dv">0</span>)]</span> <span id="cb32-2"><a href="#cb32-2" aria-hidden="true"></a>ans</span> <span id="cb32-3"><a href="#cb32-3" aria-hidden="true"></a><span class="co"># dep_delay arr_delay N</span></span> <span id="cb32-4"><a href="#cb32-4" aria-hidden="true"></a><span class="co"># 1: TRUE TRUE 72836</span></span> <span id="cb32-5"><a href="#cb32-5" aria-hidden="true"></a><span class="co"># 2: FALSE TRUE 34583</span></span> <span id="cb32-6"><a href="#cb32-6" aria-hidden="true"></a><span class="co"># 3: FALSE FALSE 119304</span></span> <span id="cb32-7"><a href="#cb32-7" aria-hidden="true"></a><span class="co"># 4: TRUE FALSE 26593</span></span></code></pre></div> </div> <div id="section-14" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>The last row corresponds to <code>dep_delay > 0 = TRUE</code> and <code>arr_delay > 0 = FALSE</code>. We can see that 26593 flights started late but arrived early (or on time).</p></li> <li><p>Note that we did not provide any names to <code>by-expression</code>. Therefore, names have been automatically assigned in the result. As with <code>j</code>, you can name these expressions as you would elements of any <code>list</code>, e.g. <code>DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]</code>.</p></li> <li><p>You can provide other columns along with expressions, for example: <code>DT[, .N, by = .(a, b>0)]</code>.</p></li> </ul> </div> </div> <div id="e-multiple-columns-in-j---.sd" class="section level3"> <h3>e) Multiple columns in <code>j</code> - <code>.SD</code></h3> <div id="do-we-have-to-compute-mean-for-each-column-individually" class="section level4"> <h4>– Do we have to compute <code>mean()</code> for each column individually?</h4> <p>It is of course not practical to have to type <code>mean(myCol)</code> for every column one by one. What if you had 100 columns to average <code>mean()</code>?</p> <p>How can we do this efficiently, concisely? To get there, refresh on <a href="#tip-1">this tip</a> - <em>“As long as the <code>j</code>-expression returns a <code>list</code>, each element of the <code>list</code> will be converted to a column in the resulting <code>data.table</code>”</em>. Suppose we can refer to the <em>data subset for each group</em> as a variable <em>while grouping</em>, then we can loop through all the columns of that variable using the already- or soon-to-be-familiar base function <code>lapply()</code>. No new names to learn specific to <code>data.table</code>.</p> </div> <div id="special-SD" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Special symbol <code>.SD</code>:</h4> <p><code>data.table</code> provides a <em>special</em> symbol, called <code>.SD</code>. It stands for <strong>S</strong>ubset of <strong>D</strong>ata. It by itself is a <code>data.table</code> that holds the data for <em>the current group</em> defined using <code>by</code>.</p> <p>Recall that a <code>data.table</code> is internally a <code>list</code> as well with all its columns of equal length.</p> <p>Let’s use the <a href="#what-is-datatable-1a"><code>data.table</code> <code>DT</code> from before</a> to get a glimpse of what <code>.SD</code> looks like.</p> <div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true"></a>DT</span> <span id="cb33-2"><a href="#cb33-2" aria-hidden="true"></a><span class="co"># ID a b c</span></span> <span id="cb33-3"><a href="#cb33-3" aria-hidden="true"></a><span class="co"># 1: b 1 7 13</span></span> <span id="cb33-4"><a href="#cb33-4" aria-hidden="true"></a><span class="co"># 2: b 2 8 14</span></span> <span id="cb33-5"><a href="#cb33-5" aria-hidden="true"></a><span class="co"># 3: b 3 9 15</span></span> <span id="cb33-6"><a href="#cb33-6" aria-hidden="true"></a><span class="co"># 4: a 4 10 16</span></span> <span id="cb33-7"><a href="#cb33-7" aria-hidden="true"></a><span class="co"># 5: a 5 11 17</span></span> <span id="cb33-8"><a href="#cb33-8" aria-hidden="true"></a><span class="co"># 6: c 6 12 18</span></span> <span id="cb33-9"><a href="#cb33-9" aria-hidden="true"></a></span> <span id="cb33-10"><a href="#cb33-10" aria-hidden="true"></a>DT[, <span class="kw">print</span>(.SD), by =<span class="st"> </span>ID]</span> <span id="cb33-11"><a href="#cb33-11" aria-hidden="true"></a><span class="co"># a b c</span></span> <span id="cb33-12"><a href="#cb33-12" aria-hidden="true"></a><span class="co"># 1: 1 7 13</span></span> <span id="cb33-13"><a href="#cb33-13" aria-hidden="true"></a><span class="co"># 2: 2 8 14</span></span> <span id="cb33-14"><a href="#cb33-14" aria-hidden="true"></a><span class="co"># 3: 3 9 15</span></span> <span id="cb33-15"><a href="#cb33-15" aria-hidden="true"></a><span class="co"># a b c</span></span> <span id="cb33-16"><a href="#cb33-16" aria-hidden="true"></a><span class="co"># 1: 4 10 16</span></span> <span id="cb33-17"><a href="#cb33-17" aria-hidden="true"></a><span class="co"># 2: 5 11 17</span></span> <span id="cb33-18"><a href="#cb33-18" aria-hidden="true"></a><span class="co"># a b c</span></span> <span id="cb33-19"><a href="#cb33-19" aria-hidden="true"></a><span class="co"># 1: 6 12 18</span></span> <span id="cb33-20"><a href="#cb33-20" aria-hidden="true"></a><span class="co"># Empty data.table (0 rows and 1 cols): ID</span></span></code></pre></div> </div> <div id="section-15" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p><code>.SD</code> contains all the columns <em>except the grouping columns</em> by default.</p></li> <li><p>It is also generated by preserving the original order - data corresponding to <code>ID = "b"</code>, then <code>ID = "a"</code>, and then <code>ID = "c"</code>.</p></li> </ul> <p>To compute on (multiple) columns, we can then simply use the base R function <code>lapply()</code>.</p> <div class="sourceCode" id="cb34"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true"></a>DT[, <span class="kw">lapply</span>(.SD, mean), by =<span class="st"> </span>ID]</span> <span id="cb34-2"><a href="#cb34-2" aria-hidden="true"></a><span class="co"># ID a b c</span></span> <span id="cb34-3"><a href="#cb34-3" aria-hidden="true"></a><span class="co"># 1: b 2.0 8.0 14.0</span></span> <span id="cb34-4"><a href="#cb34-4" aria-hidden="true"></a><span class="co"># 2: a 4.5 10.5 16.5</span></span> <span id="cb34-5"><a href="#cb34-5" aria-hidden="true"></a><span class="co"># 3: c 6.0 12.0 18.0</span></span></code></pre></div> </div> <div id="section-16" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p><code>.SD</code> holds the rows corresponding to columns <code>a</code>, <code>b</code> and <code>c</code> for that group. We compute the <code>mean()</code> on each of these columns using the already-familiar base function <code>lapply()</code>.</p></li> <li><p>Each group returns a list of three elements containing the mean value which will become the columns of the resulting <code>data.table</code>.</p></li> <li><p>Since <code>lapply()</code> returns a <code>list</code>, so there is no need to wrap it with an additional <code>.()</code> (if necessary, refer to <a href="#tip-1">this tip</a>).</p></li> </ul> <p>We are almost there. There is one little thing left to address. In our <code>flights</code> <code>data.table</code>, we only wanted to calculate the <code>mean()</code> of two columns <code>arr_delay</code> and <code>dep_delay</code>. But <code>.SD</code> would contain all the columns other than the grouping variables by default.</p> </div> <div id="how-can-we-specify-just-the-columns-we-would-like-to-compute-the-mean-on" class="section level4"> <h4>– How can we specify just the columns we would like to compute the <code>mean()</code> on?</h4> </div> <div id="sdcols" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">.SDcols</h4> <p>Using the argument <code>.SDcols</code>. It accepts either column names or column indices. For example, <code>.SDcols = c("arr_delay", "dep_delay")</code> ensures that <code>.SD</code> contains only these two columns for each group.</p> <p>Similar to <a href="#refer_j">part g)</a>, you can also provide the columns to remove instead of columns to keep using <code>-</code> or <code>!</code> sign as well as select consecutive columns as <code>colA:colB</code> and deselect consecutive columns as <code>!(colA:colB)</code> or <code>-(colA:colB)</code>.</p> <p>Now let us try to use <code>.SD</code> along with <code>.SDcols</code> to get the <code>mean()</code> of <code>arr_delay</code> and <code>dep_delay</code> columns grouped by <code>origin</code>, <code>dest</code> and <code>month</code>.</p> <div class="sourceCode" id="cb35"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true"></a>flights[carrier <span class="op">==</span><span class="st"> "AA"</span>, <span class="co">## Only on trips with carrier "AA"</span></span> <span id="cb35-2"><a href="#cb35-2" aria-hidden="true"></a> <span class="kw">lapply</span>(.SD, mean), <span class="co">## compute the mean</span></span> <span id="cb35-3"><a href="#cb35-3" aria-hidden="true"></a> by =<span class="st"> </span>.(origin, dest, month), <span class="co">## for every 'origin,dest,month'</span></span> <span id="cb35-4"><a href="#cb35-4" aria-hidden="true"></a> .SDcols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"arr_delay"</span>, <span class="st">"dep_delay"</span>)] <span class="co">## for just those specified in .SDcols</span></span> <span id="cb35-5"><a href="#cb35-5" aria-hidden="true"></a><span class="co"># origin dest month arr_delay dep_delay</span></span> <span id="cb35-6"><a href="#cb35-6" aria-hidden="true"></a><span class="co"># 1: JFK LAX 1 6.590361 14.2289157</span></span> <span id="cb35-7"><a href="#cb35-7" aria-hidden="true"></a><span class="co"># 2: LGA PBI 1 -7.758621 0.3103448</span></span> <span id="cb35-8"><a href="#cb35-8" aria-hidden="true"></a><span class="co"># 3: EWR LAX 1 1.366667 7.5000000</span></span> <span id="cb35-9"><a href="#cb35-9" aria-hidden="true"></a><span class="co"># 4: JFK MIA 1 15.720670 18.7430168</span></span> <span id="cb35-10"><a href="#cb35-10" aria-hidden="true"></a><span class="co"># 5: JFK SEA 1 14.357143 30.7500000</span></span> <span id="cb35-11"><a href="#cb35-11" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb35-12"><a href="#cb35-12" aria-hidden="true"></a><span class="co"># 196: LGA MIA 10 -6.251799 -1.4208633</span></span> <span id="cb35-13"><a href="#cb35-13" aria-hidden="true"></a><span class="co"># 197: JFK MIA 10 -1.880184 6.6774194</span></span> <span id="cb35-14"><a href="#cb35-14" aria-hidden="true"></a><span class="co"># 198: EWR PHX 10 -3.032258 -4.2903226</span></span> <span id="cb35-15"><a href="#cb35-15" aria-hidden="true"></a><span class="co"># 199: JFK MCO 10 -10.048387 -1.6129032</span></span> <span id="cb35-16"><a href="#cb35-16" aria-hidden="true"></a><span class="co"># 200: JFK DCA 10 16.483871 15.5161290</span></span></code></pre></div> </div> </div> <div id="f-subset-.sd-for-each-group" class="section level3"> <h3>f) Subset <code>.SD</code> for each group:</h3> <div id="how-can-we-return-the-first-two-rows-for-each-month" class="section level4"> <h4>– How can we return the first two rows for each <code>month</code>?</h4> <div class="sourceCode" id="cb36"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[, <span class="kw">head</span>(.SD, <span class="dv">2</span>), by =<span class="st"> </span>month]</span> <span id="cb36-2"><a href="#cb36-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb36-3"><a href="#cb36-3" aria-hidden="true"></a><span class="co"># month year day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb36-4"><a href="#cb36-4" aria-hidden="true"></a><span class="co"># 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb36-5"><a href="#cb36-5" aria-hidden="true"></a><span class="co"># 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb36-6"><a href="#cb36-6" aria-hidden="true"></a><span class="co"># 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8</span></span> <span id="cb36-7"><a href="#cb36-7" aria-hidden="true"></a><span class="co"># 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11</span></span> <span id="cb36-8"><a href="#cb36-8" aria-hidden="true"></a><span class="co"># 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8</span></span> <span id="cb36-9"><a href="#cb36-9" aria-hidden="true"></a><span class="co"># 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11</span></span></code></pre></div> </div> <div id="section-17" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p><code>.SD</code> is a <code>data.table</code> that holds all the rows for <em>that group</em>. We simply subset the first two rows as we have seen <a href="#subset-rows-integer">here</a> already.</p></li> <li><p>For each group, <code>head(.SD, 2)</code> returns the first two rows as a <code>data.table</code>, which is also a <code>list</code>, so we do not have to wrap it with <code>.()</code>.</p></li> </ul> </div> </div> <div id="g-why-keep-j-so-flexible" class="section level3"> <h3>g) Why keep <code>j</code> so flexible?</h3> <p>So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the <code>data.table</code> <code>DT</code> that we created at the very beginning under <a href="#what-is-datatable-1a">What is a data.table?</a> section.</p> <div id="how-can-we-concatenate-columns-a-and-b-for-each-group-in-id" class="section level4"> <h4>– How can we concatenate columns <code>a</code> and <code>b</code> for each group in <code>ID</code>?</h4> <div class="sourceCode" id="cb37"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true"></a>DT[, .(<span class="dt">val =</span> <span class="kw">c</span>(a,b)), by =<span class="st"> </span>ID]</span> <span id="cb37-2"><a href="#cb37-2" aria-hidden="true"></a><span class="co"># ID val</span></span> <span id="cb37-3"><a href="#cb37-3" aria-hidden="true"></a><span class="co"># 1: b 1</span></span> <span id="cb37-4"><a href="#cb37-4" aria-hidden="true"></a><span class="co"># 2: b 2</span></span> <span id="cb37-5"><a href="#cb37-5" aria-hidden="true"></a><span class="co"># 3: b 3</span></span> <span id="cb37-6"><a href="#cb37-6" aria-hidden="true"></a><span class="co"># 4: b 7</span></span> <span id="cb37-7"><a href="#cb37-7" aria-hidden="true"></a><span class="co"># 5: b 8</span></span> <span id="cb37-8"><a href="#cb37-8" aria-hidden="true"></a><span class="co"># 6: b 9</span></span> <span id="cb37-9"><a href="#cb37-9" aria-hidden="true"></a><span class="co"># 7: a 4</span></span> <span id="cb37-10"><a href="#cb37-10" aria-hidden="true"></a><span class="co"># 8: a 5</span></span> <span id="cb37-11"><a href="#cb37-11" aria-hidden="true"></a><span class="co"># 9: a 10</span></span> <span id="cb37-12"><a href="#cb37-12" aria-hidden="true"></a><span class="co"># 10: a 11</span></span> <span id="cb37-13"><a href="#cb37-13" aria-hidden="true"></a><span class="co"># 11: c 6</span></span> <span id="cb37-14"><a href="#cb37-14" aria-hidden="true"></a><span class="co"># 12: c 12</span></span></code></pre></div> </div> <div id="section-18" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li>That’s it. There is no special syntax required. All we need to know is the base function <code>c()</code> which concatenates vectors and <a href="#tip-1">the tip from before</a>.</li> </ul> </div> <div id="what-if-we-would-like-to-have-all-the-values-of-column-a-and-b-concatenated-but-returned-as-a-list-column" class="section level4"> <h4>– What if we would like to have all the values of column <code>a</code> and <code>b</code> concatenated, but returned as a list column?</h4> <div class="sourceCode" id="cb38"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true"></a>DT[, .(<span class="dt">val =</span> <span class="kw">list</span>(<span class="kw">c</span>(a,b))), by =<span class="st"> </span>ID]</span> <span id="cb38-2"><a href="#cb38-2" aria-hidden="true"></a><span class="co"># ID val</span></span> <span id="cb38-3"><a href="#cb38-3" aria-hidden="true"></a><span class="co"># 1: b 1,2,3,7,8,9</span></span> <span id="cb38-4"><a href="#cb38-4" aria-hidden="true"></a><span class="co"># 2: a 4, 5,10,11</span></span> <span id="cb38-5"><a href="#cb38-5" aria-hidden="true"></a><span class="co"># 3: c 6,12</span></span></code></pre></div> </div> <div id="section-19" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"></h4> <ul> <li><p>Here, we first concatenate the values with <code>c(a,b)</code> for each group, and wrap that with <code>list()</code>. So for each group, we return a list of all concatenated values.</p></li> <li><p>Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.</p></li> </ul> <p>Once you start internalising usage in <code>j</code>, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of <code>print()</code>.</p> <p>For example:</p> <div class="sourceCode" id="cb39"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true"></a><span class="co">## (1) look at the difference between</span></span> <span id="cb39-2"><a href="#cb39-2" aria-hidden="true"></a>DT[, <span class="kw">print</span>(<span class="kw">c</span>(a,b)), by =<span class="st"> </span>ID]</span> <span id="cb39-3"><a href="#cb39-3" aria-hidden="true"></a><span class="co"># [1] 1 2 3 7 8 9</span></span> <span id="cb39-4"><a href="#cb39-4" aria-hidden="true"></a><span class="co"># [1] 4 5 10 11</span></span> <span id="cb39-5"><a href="#cb39-5" aria-hidden="true"></a><span class="co"># [1] 6 12</span></span> <span id="cb39-6"><a href="#cb39-6" aria-hidden="true"></a><span class="co"># Empty data.table (0 rows and 1 cols): ID</span></span> <span id="cb39-7"><a href="#cb39-7" aria-hidden="true"></a></span> <span id="cb39-8"><a href="#cb39-8" aria-hidden="true"></a><span class="co">## (2) and</span></span> <span id="cb39-9"><a href="#cb39-9" aria-hidden="true"></a>DT[, <span class="kw">print</span>(<span class="kw">list</span>(<span class="kw">c</span>(a,b))), by =<span class="st"> </span>ID]</span> <span id="cb39-10"><a href="#cb39-10" aria-hidden="true"></a><span class="co"># [[1]]</span></span> <span id="cb39-11"><a href="#cb39-11" aria-hidden="true"></a><span class="co"># [1] 1 2 3 7 8 9</span></span> <span id="cb39-12"><a href="#cb39-12" aria-hidden="true"></a><span class="co"># </span></span> <span id="cb39-13"><a href="#cb39-13" aria-hidden="true"></a><span class="co"># [[1]]</span></span> <span id="cb39-14"><a href="#cb39-14" aria-hidden="true"></a><span class="co"># [1] 4 5 10 11</span></span> <span id="cb39-15"><a href="#cb39-15" aria-hidden="true"></a><span class="co"># </span></span> <span id="cb39-16"><a href="#cb39-16" aria-hidden="true"></a><span class="co"># [[1]]</span></span> <span id="cb39-17"><a href="#cb39-17" aria-hidden="true"></a><span class="co"># [1] 6 12</span></span> <span id="cb39-18"><a href="#cb39-18" aria-hidden="true"></a><span class="co"># Empty data.table (0 rows and 1 cols): ID</span></span></code></pre></div> <p>In (1), for each group, a vector is returned, with length = 6,4,2 here. However (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore (1) results in a length of <code>6+4+2 = 12</code>, whereas (2) returns <code>1+1+1=3</code>.</p> </div> </div> </div> <div id="summary" class="section level2"> <h2>Summary</h2> <p>The general form of <code>data.table</code> syntax is:</p> <div class="sourceCode" id="cb40"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb40-1"><a href="#cb40-1" aria-hidden="true"></a>DT[i, j, by]</span></code></pre></div> <p>We have seen so far that,</p> <div id="using-i" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Using <code>i</code>:</h4> <ul> <li><p>We can subset rows similar to a <code>data.frame</code>- except you don’t have to use <code>DT$</code> repetitively since columns within the frame of a <code>data.table</code> are seen as if they are <em>variables</em>.</p></li> <li><p>We can also sort a <code>data.table</code> using <code>order()</code>, which internally uses <code>data.table</code>’s fast order for performance.</p></li> </ul> <p>We can do much more in <code>i</code> by keying a <code>data.table</code>, which allows blazing fast subsets and joins. We will see this in the <em>“Keys and fast binary search based subsets”</em> and <em>“Joins and rolling joins”</em> vignette.</p> </div> <div id="using-j" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Using <code>j</code>:</h4> <ol style="list-style-type: decimal"> <li><p>Select columns the <code>data.table</code> way: <code>DT[, .(colA, colB)]</code>.</p></li> <li><p>Select columns the <code>data.frame</code> way: <code>DT[, c("colA", "colB")]</code>.</p></li> <li><p>Compute on columns: <code>DT[, .(sum(colA), mean(colB))]</code>.</p></li> <li><p>Provide names if necessary: <code>DT[, .(sA =sum(colA), mB = mean(colB))]</code>.</p></li> <li><p>Combine with <code>i</code>: <code>DT[colA > value, sum(colB)]</code>.</p></li> </ol> </div> <div id="using-by" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info">Using <code>by</code>:</h4> <ul> <li><p>Using <code>by</code>, we can group by columns by specifying a <em>list of columns</em> or a <em>character vector of column names</em> or even <em>expressions</em>. The flexibility of <code>j</code>, combined with <code>by</code> and <code>i</code> makes for a very powerful syntax.</p></li> <li><p><code>by</code> can handle multiple columns and also <em>expressions</em>.</p></li> <li><p>We can <code>keyby</code> grouping columns to automatically sort the grouped result.</p></li> <li><p>We can use <code>.SD</code> and <code>.SDcols</code> in <code>j</code> to operate on multiple columns using already familiar base functions. Here are some examples:</p> <ol style="list-style-type: decimal"> <li><p><code>DT[, lapply(.SD, fun), by = ..., .SDcols = ...]</code> - applies <code>fun</code> to all columns specified in <code>.SDcols</code> while grouping by the columns specified in <code>by</code>.</p></li> <li><p><code>DT[, head(.SD, 2), by = ...]</code> - return the first two rows for each group.</p></li> <li><p><code>DT[col > val, head(.SD, 1), by = ...]</code> - combine <code>i</code> along with <code>j</code> and <code>by</code>.</p></li> </ol></li> </ul> </div> <div id="and-remember-the-tip" class="section level4 bs-callout bs-callout-warning"> <h4 class="bs-callout bs-callout-warning">And remember the tip:</h4> <p>As long as <code>j</code> returns a <code>list</code>, each element of the list will become a column in the resulting <code>data.table</code>.</p> <p>We will see how to <em>add/update/delete</em> columns <em>by reference</em> and how to combine them with <code>i</code> and <code>by</code> in the next vignette.</p> <hr /> </div> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>