EVOLUTION-MANAGER
Edit File: datatable-secondary-indices-and-auto-indexing.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="date" content="2022-10-09" /> <title>Secondary indices and auto indexing</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <script>// Hide empty <a> tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css"> code { white-space: pre; } .sourceCode { overflow: visible; } </style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } var j = 0; while (j < rules.length) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") { j++; continue; } var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') { j++; continue; } // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Secondary indices and auto indexing</h1> <h4 class="date">2022-10-09</h4> <p>This vignette assumes that the reader is familiar with data.table’s <code>[i, j, by]</code> syntax, and how to perform fast key based subsets. If you’re not familiar with these concepts, please read the <em>“Introduction to data.table”</em>, <em>“Reference semantics”</em> and <em>“Keys and fast binary search based subset”</em> vignettes first.</p> <hr /> <div id="data" class="section level2"> <h2>Data</h2> <p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true"></a>flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)</span> <span id="cb1-2"><a href="#cb1-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb1-3"><a href="#cb1-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb1-4"><a href="#cb1-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb1-5"><a href="#cb1-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb1-6"><a href="#cb1-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb1-7"><a href="#cb1-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb1-8"><a href="#cb1-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb1-9"><a href="#cb1-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span> <span id="cb1-10"><a href="#cb1-10" aria-hidden="true"></a><span class="kw">dim</span>(flights)</span> <span id="cb1-11"><a href="#cb1-11" aria-hidden="true"></a><span class="co"># [1] 253316 11</span></span></code></pre></div> </div> <div id="introduction" class="section level2"> <h2>Introduction</h2> <p>In this vignette, we will</p> <ul> <li><p>discuss <em>secondary indices</em> and provide rationale as to why we need them by citing cases where setting keys is not necessarily ideal,</p></li> <li><p>perform fast subsetting, once again, but using the new <code>on</code> argument, which computes secondary indices internally for the task (temporarily), and reuses if one already exists,</p></li> <li><p>and finally look at <em>auto indexing</em> which goes a step further and creates secondary indices automatically, but does so on native R syntax for subsetting.</p></li> </ul> </div> <div id="secondary-indices" class="section level2"> <h2>1. Secondary indices</h2> <div id="a-what-are-secondary-indices" class="section level3"> <h3>a) What are secondary indices?</h3> <p>Secondary indices are similar to <code>keys</code> in <em>data.table</em>, except for two major differences:</p> <ul> <li><p>It <em>doesn’t</em> physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that <em>order vector</em> in an additional attribute called <code>index</code>.</p></li> <li><p>There can be more than one secondary index for a data.table (as we will see below).</p></li> </ul> </div> <div id="b-set-and-get-secondary-indices" class="section level3"> <h3>b) Set and get secondary indices</h3> <div id="how-can-we-set-the-column-origin-as-a-secondary-index-in-the-data.table-flights" class="section level4"> <h4>– How can we set the column <code>origin</code> as a secondary index in the <em>data.table</em> <code>flights</code>?</h4> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true"></a><span class="kw">setindex</span>(flights, origin)</span> <span id="cb2-2"><a href="#cb2-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb2-3"><a href="#cb2-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb2-4"><a href="#cb2-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb2-5"><a href="#cb2-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb2-6"><a href="#cb2-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb2-7"><a href="#cb2-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb2-8"><a href="#cb2-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb2-9"><a href="#cb2-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span> <span id="cb2-10"><a href="#cb2-10" aria-hidden="true"></a></span> <span id="cb2-11"><a href="#cb2-11" aria-hidden="true"></a><span class="co">## alternatively we can provide character vectors to the function 'setindexv()'</span></span> <span id="cb2-12"><a href="#cb2-12" aria-hidden="true"></a><span class="co"># setindexv(flights, "origin") # useful to program with</span></span> <span id="cb2-13"><a href="#cb2-13" aria-hidden="true"></a></span> <span id="cb2-14"><a href="#cb2-14" aria-hidden="true"></a><span class="co"># 'index' attribute added</span></span> <span id="cb2-15"><a href="#cb2-15" aria-hidden="true"></a><span class="kw">names</span>(<span class="kw">attributes</span>(flights))</span> <span id="cb2-16"><a href="#cb2-16" aria-hidden="true"></a><span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span></span> <span id="cb2-17"><a href="#cb2-17" aria-hidden="true"></a><span class="co"># [5] "index"</span></span></code></pre></div> <ul> <li><p><code>setindex</code> and <code>setindexv()</code> allows adding a secondary index to the data.table.</p></li> <li><p>Note that <code>flights</code> is <strong>not</strong> physically reordered in increasing order of <code>origin</code>, as would have been the case with <code>setkey()</code>.</p></li> <li><p>Also note that the attribute <code>index</code> has been added to <code>flights</code>.</p></li> <li><p><code>setindex(flights, NULL)</code> would remove all secondary indices.</p></li> </ul> </div> <div id="how-can-we-get-all-the-secondary-indices-set-so-far-in-flights" class="section level4"> <h4>– How can we get all the secondary indices set so far in <code>flights</code>?</h4> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true"></a><span class="kw">indices</span>(flights)</span> <span id="cb3-2"><a href="#cb3-2" aria-hidden="true"></a><span class="co"># [1] "origin"</span></span> <span id="cb3-3"><a href="#cb3-3" aria-hidden="true"></a></span> <span id="cb3-4"><a href="#cb3-4" aria-hidden="true"></a><span class="kw">setindex</span>(flights, origin, dest)</span> <span id="cb3-5"><a href="#cb3-5" aria-hidden="true"></a><span class="kw">indices</span>(flights)</span> <span id="cb3-6"><a href="#cb3-6" aria-hidden="true"></a><span class="co"># [1] "origin" "origin__dest"</span></span></code></pre></div> <ul> <li><p>The function <code>indices()</code> returns all current secondary indices in the data.table. If none exists, <code>NULL</code> is returned.</p></li> <li><p>Note that by creating another index on the columns <code>origin, dest</code>, we do not lose the first index created on the column <code>origin</code>, i.e., we can have multiple secondary indices.</p></li> </ul> </div> </div> <div id="c-why-do-we-need-secondary-indices" class="section level3"> <h3>c) Why do we need secondary indices?</h3> <div id="reordering-a-data.table-can-be-expensive-and-not-always-ideal" class="section level4"> <h4>– Reordering a data.table can be expensive and not always ideal</h4> <p>Consider the case where you would like to perform a fast key based subset on <code>origin</code> column for the value “JFK”. We’d do this as:</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true"></a><span class="co">## not run</span></span> <span id="cb4-2"><a href="#cb4-2" aria-hidden="true"></a><span class="kw">setkey</span>(flights, origin)</span> <span id="cb4-3"><a href="#cb4-3" aria-hidden="true"></a>flights[<span class="st">"JFK"</span>] <span class="co"># or flights[.("JFK")]</span></span></code></pre></div> </div> <div id="setkey-requires" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"><code>setkey()</code> requires:</h4> <ol style="list-style-type: lower-alpha"> <li><p>computing the order vector for the column(s) provided, here, <code>origin</code>, and</p></li> <li><p>reordering the entire data.table, by reference, based on the order vector computed.</p></li> </ol> </div> </div> </div> <div id="section" class="section level1"> <h1></h1> <p>Computing the order isn’t the time consuming part, since data.table uses true radix sorting on integer, character and numeric vectors. However reordering the data.table could be time consuming (depending on the number of rows and columns).</p> <p>Unless our task involves repeated subsetting on the same column, fast key based subsetting could effectively be nullified by the time to reorder, depending on our data.table dimensions.</p> <div id="there-can-be-only-one-key-at-the-most" class="section level4"> <h4>– There can be only one <code>key</code> at the most</h4> <p>Now if we would like to repeat the same operation but on <code>dest</code> column instead, for the value “LAX”, then we have to <code>setkey()</code>, <em>again</em>.</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true"></a><span class="co">## not run</span></span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true"></a><span class="kw">setkey</span>(flights, dest)</span> <span id="cb5-3"><a href="#cb5-3" aria-hidden="true"></a>flights[<span class="st">"LAX"</span>]</span></code></pre></div> <p>And this reorders <code>flights</code> by <code>dest</code>, <em>again</em>. What we would really like is to be able to perform the fast subsetting by eliminating the reordering step.</p> <p>And this is precisely what <em>secondary indices</em> allow for!</p> </div> <div id="secondary-indices-can-be-reused" class="section level4"> <h4>– Secondary indices can be reused</h4> <p>Since there can be multiple secondary indices, and creating an index is as simple as storing the order vector as an attribute, this allows us to even eliminate the time to recompute the order vector if an index already exists.</p> </div> <div id="the-new-on-argument-allows-for-cleaner-syntax-and-automatic-creation-and-reuse-of-secondary-indices" class="section level4"> <h4>– The new <code>on</code> argument allows for cleaner syntax and automatic creation and reuse of secondary indices</h4> <p>As we will see in the next section, the <code>on</code> argument provides several advantages:</p> </div> <div id="on-argument" class="section level4 bs-callout bs-callout-info"> <h4 class="bs-callout bs-callout-info"><code>on</code> argument</h4> <ul> <li><p>enables subsetting by computing secondary indices on the fly. This eliminates having to do <code>setindex()</code> every time.</p></li> <li><p>allows easy reuse of existing indices by just checking the attributes.</p></li> <li><p>allows for a cleaner syntax by having the columns on which the subset is performed as part of the syntax. This makes the code easier to follow when looking at it at a later point.</p> <p>Note that <code>on</code> argument can also be used on keyed subsets as well. In fact, we encourage to provide the <code>on</code> argument even when subsetting using keys for better readability.</p></li> </ul> </div> </div> <div id="section-1" class="section level1"> <h1></h1> <div id="fast-subsetting-using-on-argument-and-secondary-indices" class="section level2"> <h2>2. Fast subsetting using <code>on</code> argument and secondary indices</h2> <div id="a-fast-subsets-in-i" class="section level3"> <h3>a) Fast subsets in <code>i</code></h3> <div id="subset-all-rows-where-the-origin-airport-matches-jfk-using-on" class="section level4"> <h4>– Subset all rows where the origin airport matches <em>“JFK”</em> using <code>on</code></h4> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true"></a>flights[<span class="st">"JFK"</span>, on =<span class="st"> "origin"</span>]</span> <span id="cb6-2"><a href="#cb6-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb6-3"><a href="#cb6-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb6-4"><a href="#cb6-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb6-5"><a href="#cb6-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb6-6"><a href="#cb6-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb6-7"><a href="#cb6-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></span> <span id="cb6-8"><a href="#cb6-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb6-9"><a href="#cb6-9" aria-hidden="true"></a><span class="co"># 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17</span></span> <span id="cb6-10"><a href="#cb6-10" aria-hidden="true"></a><span class="co"># 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18</span></span> <span id="cb6-11"><a href="#cb6-11" aria-hidden="true"></a><span class="co"># 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17</span></span> <span id="cb6-12"><a href="#cb6-12" aria-hidden="true"></a><span class="co"># 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9</span></span> <span id="cb6-13"><a href="#cb6-13" aria-hidden="true"></a><span class="co"># 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11</span></span> <span id="cb6-14"><a href="#cb6-14" aria-hidden="true"></a></span> <span id="cb6-15"><a href="#cb6-15" aria-hidden="true"></a><span class="co">## alternatively</span></span> <span id="cb6-16"><a href="#cb6-16" aria-hidden="true"></a><span class="co"># flights[.("JFK"), on = "origin"] (or)</span></span> <span id="cb6-17"><a href="#cb6-17" aria-hidden="true"></a><span class="co"># flights[list("JFK"), on = "origin"]</span></span></code></pre></div> <ul> <li><p>This statement performs a fast binary search based subset as well, by computing the index on the fly. However, note that it doesn’t save the index as an attribute automatically. This may change in the future.</p></li> <li><p>If we had already created a secondary index, using <code>setindex()</code>, then <code>on</code> would reuse it instead of (re)computing it. We can see that by using <code>verbose = TRUE</code>:</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true"></a><span class="kw">setindex</span>(flights, origin)</span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true"></a>flights[<span class="st">"JFK"</span>, on =<span class="st"> "origin"</span>, verbose =<span class="st"> </span><span class="ot">TRUE</span>][<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>]</span> <span id="cb7-3"><a href="#cb7-3" aria-hidden="true"></a><span class="co"># i.V1 has same type (character) as x.origin. No coercion needed.</span></span> <span id="cb7-4"><a href="#cb7-4" aria-hidden="true"></a><span class="co"># on= matches existing index, using index</span></span> <span id="cb7-5"><a href="#cb7-5" aria-hidden="true"></a><span class="co"># Starting bmerge ...</span></span> <span id="cb7-6"><a href="#cb7-6" aria-hidden="true"></a><span class="co"># forder.c received 1 rows and 1 columns</span></span> <span id="cb7-7"><a href="#cb7-7" aria-hidden="true"></a><span class="co"># bmerge done in 0.000s elapsed (0.000s cpu) </span></span> <span id="cb7-8"><a href="#cb7-8" aria-hidden="true"></a><span class="co"># Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)</span></span> <span id="cb7-9"><a href="#cb7-9" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb7-10"><a href="#cb7-10" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb7-11"><a href="#cb7-11" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb7-12"><a href="#cb7-12" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb7-13"><a href="#cb7-13" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb7-14"><a href="#cb7-14" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></span></code></pre></div></li> </ul> </div> <div id="how-can-i-subset-based-on-origin-and-dest-columns" class="section level4"> <h4>– How can I subset based on <code>origin</code> <em>and</em> <code>dest</code> columns?</h4> <p>For example, if we want to subset <code>"JFK", "LAX"</code> combination, then:</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>, <span class="st">"LAX"</span>), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)][<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>]</span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb8-4"><a href="#cb8-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb8-5"><a href="#cb8-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb8-6"><a href="#cb8-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb8-7"><a href="#cb8-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></span></code></pre></div> <ul> <li><p><code>on</code> argument accepts a character vector of column names corresponding to the order provided to <code>i-argument</code>.</p></li> <li><p>Since the time to compute the secondary index is quite small, we don’t have to use <code>setindex()</code>, unless, once again, the task involves repeated subsetting on the same column.</p></li> </ul> </div> </div> <div id="b-select-in-j" class="section level3"> <h3>b) Select in <code>j</code></h3> <p>All the operations we will discuss below are no different to the ones we already saw in the <em>Keys and fast binary search based subset</em> vignette. Except we’ll be using the <code>on</code> argument instead of setting keys.</p> <div id="return-arr_delay-column-alone-as-a-data.table-corresponding-to-origin-lga-and-dest-tpa" class="section level4"> <h4>– Return <code>arr_delay</code> column alone as a data.table corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code></h4> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)]</span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true"></a><span class="co"># arr_delay</span></span> <span id="cb9-3"><a href="#cb9-3" aria-hidden="true"></a><span class="co"># 1: 1</span></span> <span id="cb9-4"><a href="#cb9-4" aria-hidden="true"></a><span class="co"># 2: 14</span></span> <span id="cb9-5"><a href="#cb9-5" aria-hidden="true"></a><span class="co"># 3: -17</span></span> <span id="cb9-6"><a href="#cb9-6" aria-hidden="true"></a><span class="co"># 4: -4</span></span> <span id="cb9-7"><a href="#cb9-7" aria-hidden="true"></a><span class="co"># 5: -12</span></span> <span id="cb9-8"><a href="#cb9-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb9-9"><a href="#cb9-9" aria-hidden="true"></a><span class="co"># 1848: 39</span></span> <span id="cb9-10"><a href="#cb9-10" aria-hidden="true"></a><span class="co"># 1849: -24</span></span> <span id="cb9-11"><a href="#cb9-11" aria-hidden="true"></a><span class="co"># 1850: -12</span></span> <span id="cb9-12"><a href="#cb9-12" aria-hidden="true"></a><span class="co"># 1851: 21</span></span> <span id="cb9-13"><a href="#cb9-13" aria-hidden="true"></a><span class="co"># 1852: -11</span></span></code></pre></div> </div> </div> <div id="c-chaining" class="section level3"> <h3>c) Chaining</h3> <div id="on-the-result-obtained-above-use-chaining-to-order-the-column-in-decreasing-order." class="section level4"> <h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)][<span class="kw">order</span>(<span class="op">-</span>arr_delay)]</span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true"></a><span class="co"># arr_delay</span></span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true"></a><span class="co"># 1: 486</span></span> <span id="cb10-4"><a href="#cb10-4" aria-hidden="true"></a><span class="co"># 2: 380</span></span> <span id="cb10-5"><a href="#cb10-5" aria-hidden="true"></a><span class="co"># 3: 351</span></span> <span id="cb10-6"><a href="#cb10-6" aria-hidden="true"></a><span class="co"># 4: 318</span></span> <span id="cb10-7"><a href="#cb10-7" aria-hidden="true"></a><span class="co"># 5: 300</span></span> <span id="cb10-8"><a href="#cb10-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb10-9"><a href="#cb10-9" aria-hidden="true"></a><span class="co"># 1848: -40</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true"></a><span class="co"># 1849: -43</span></span> <span id="cb10-11"><a href="#cb10-11" aria-hidden="true"></a><span class="co"># 1850: -46</span></span> <span id="cb10-12"><a href="#cb10-12" aria-hidden="true"></a><span class="co"># 1851: -48</span></span> <span id="cb10-13"><a href="#cb10-13" aria-hidden="true"></a><span class="co"># 1852: -49</span></span></code></pre></div> </div> </div> <div id="d-compute-or-do-in-j" class="section level3"> <h3>d) Compute or <em>do</em> in <code>j</code></h3> <div id="find-the-maximum-arrival-delay-corresponding-to-origin-lga-and-dest-tpa." class="section level4"> <h4>– Find the maximum arrival delay corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="kw">max</span>(arr_delay), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>)]</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true"></a><span class="co"># [1] 486</span></span></code></pre></div> </div> </div> <div id="e-sub-assign-by-reference-using-in-j" class="section level3"> <h3>e) <em>sub-assign</em> by reference using <code>:=</code> in <code>j</code></h3> <p>We have seen this example already in the <em>Reference semantics</em> and <em>Keys and fast binary search based subset</em> vignette. Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true"></a><span class="co"># get all 'hours' in flights</span></span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb12-3"><a href="#cb12-3" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></span></code></pre></div> <p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>, but this time using <code>on</code> instead of setting keys.</p> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true"></a>flights[.(24L), hour <span class="op">:</span><span class="er">=</span><span class="st"> </span>0L, on =<span class="st"> "hour"</span>]</span></code></pre></div> <p>Now, let’s check if <code>24</code> is replaced with <code>0</code> in the <code>hour</code> column.</p> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb14-2"><a href="#cb14-2" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></span></code></pre></div> <ul> <li>This is particularly a huge advantage of secondary indices. Previously, just to update a few rows of <code>hour</code>, we had to <code>setkey()</code> on it, which inevitably reorders the entire data.table. With <code>on</code>, the order is preserved, and the operation is much faster! Looking at the code, the task we wanted to perform is also quite clear.</li> </ul> </div> <div id="f-aggregation-using-by" class="section level3"> <h3>f) Aggregation using <code>by</code></h3> <div id="get-the-maximum-departure-delay-for-each-month-corresponding-to-origin-jfk.-order-the-result-by-month" class="section level4"> <h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[<span class="st">"JFK"</span>, <span class="kw">max</span>(dep_delay), keyby =<span class="st"> </span>month, on =<span class="st"> "origin"</span>]</span> <span id="cb15-2"><a href="#cb15-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb15-3"><a href="#cb15-3" aria-hidden="true"></a><span class="co"># month V1</span></span> <span id="cb15-4"><a href="#cb15-4" aria-hidden="true"></a><span class="co"># 1: 1 881</span></span> <span id="cb15-5"><a href="#cb15-5" aria-hidden="true"></a><span class="co"># 2: 2 1014</span></span> <span id="cb15-6"><a href="#cb15-6" aria-hidden="true"></a><span class="co"># 3: 3 920</span></span> <span id="cb15-7"><a href="#cb15-7" aria-hidden="true"></a><span class="co"># 4: 4 1241</span></span> <span id="cb15-8"><a href="#cb15-8" aria-hidden="true"></a><span class="co"># 5: 5 853</span></span> <span id="cb15-9"><a href="#cb15-9" aria-hidden="true"></a><span class="co"># 6: 6 798</span></span></code></pre></div> <ul> <li>We would have had to set the <code>key</code> back to <code>origin, dest</code> again, if we did not use <code>on</code> which internally builds secondary indices on the fly.</li> </ul> </div> </div> <div id="g-the-mult-argument" class="section level3"> <h3>g) The <em>mult</em> argument</h3> <p>The other arguments including <code>mult</code> work exactly the same way as we saw in the <em>Keys and fast binary search based subset</em> vignette. The default value for <code>mult</code> is “all”. We can choose, instead only the “first” or “last” matching rows should be returned.</p> <div id="subset-only-the-first-matching-row-where-dest-matches-bos-and-day" class="section level4"> <h4>– Subset only the first matching row where <code>dest</code> matches <em>“BOS”</em> and <em>“DAY”</em></h4> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true"></a>flights[<span class="kw">c</span>(<span class="st">"BOS"</span>, <span class="st">"DAY"</span>), on =<span class="st"> "dest"</span>, mult =<span class="st"> "first"</span>]</span> <span id="cb16-2"><a href="#cb16-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb16-3"><a href="#cb16-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 3 1 AA JFK BOS 39 187 12</span></span> <span id="cb16-4"><a href="#cb16-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 25 35 EV EWR DAY 102 533 17</span></span></code></pre></div> </div> <div id="subset-only-the-last-matching-row-where-origin-matches-lga-jfk-ewr-and-dest-matches-xna" class="section level4"> <h4>– Subset only the last matching row where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true"></a>flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>), mult =<span class="st"> "last"</span>]</span> <span id="cb17-2"><a href="#cb17-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb17-3"><a href="#cb17-3" aria-hidden="true"></a><span class="co"># 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6</span></span> <span id="cb17-4"><a href="#cb17-4" aria-hidden="true"></a><span class="co"># 2: NA NA NA NA NA <NA> JFK XNA NA NA NA</span></span> <span id="cb17-5"><a href="#cb17-5" aria-hidden="true"></a><span class="co"># 3: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6</span></span></code></pre></div> </div> </div> <div id="h-the-nomatch-argument" class="section level3"> <h3>h) The <em>nomatch</em> argument</h3> <p>We can choose if queries that do not match should return <code>NA</code> or be skipped altogether using the <code>nomatch</code> argument.</p> <div id="from-the-previous-example-subset-all-rows-only-if-theres-a-match" class="section level4"> <h4>– From the previous example, subset all rows only if there’s a match</h4> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true"></a>flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>, on =<span class="st"> </span><span class="kw">c</span>(<span class="st">"origin"</span>, <span class="st">"dest"</span>), nomatch =<span class="st"> </span><span class="ot">NULL</span>]</span> <span id="cb18-2"><a href="#cb18-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb18-3"><a href="#cb18-3" aria-hidden="true"></a><span class="co"># 1: 2014 10 31 -5 -11 MQ LGA XNA 165 1147 6</span></span> <span id="cb18-4"><a href="#cb18-4" aria-hidden="true"></a><span class="co"># 2: 2014 10 31 -2 -25 EV EWR XNA 160 1131 6</span></span></code></pre></div> <ul> <li>There are no flights connecting “JFK” and “XNA”. Therefore, that row is skipped in the result.</li> </ul> </div> </div> </div> <div id="auto-indexing" class="section level2"> <h2>3. Auto indexing</h2> <p>First we looked at how to fast subset using binary search using <em>keys</em>. Then we figured out that we could improve performance even further and have more cleaner syntax by using secondary indices.</p> <p>That is what <em>auto indexing</em> does. At the moment, it is only implemented for binary operators <code>==</code> and <code>%in%</code>. An index is automatically created <em>and</em> saved as an attribute. That is, unlike the <code>on</code> argument which computes the index on the fly each time (unless one already exists), a secondary index is created here.</p> <p>Let’s start by creating a data.table big enough to highlight the advantage.</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true"></a><span class="kw">set.seed</span>(1L)</span> <span id="cb19-2"><a href="#cb19-2" aria-hidden="true"></a>dt =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> <span class="kw">sample</span>(<span class="fl">1e5</span>L, <span class="fl">1e7</span>L, <span class="ot">TRUE</span>), <span class="dt">y =</span> <span class="kw">runif</span>(100L))</span> <span id="cb19-3"><a href="#cb19-3" aria-hidden="true"></a><span class="kw">print</span>(<span class="kw">object.size</span>(dt), <span class="dt">units =</span> <span class="st">"Mb"</span>)</span> <span id="cb19-4"><a href="#cb19-4" aria-hidden="true"></a><span class="co"># 114.4 Mb</span></span></code></pre></div> <p>When we use <code>==</code> or <code>%in%</code> on a single column for the first time, a secondary index is created automatically, and it is used to perform the subset.</p> <div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true"></a><span class="co">## have a look at all the attribute names</span></span> <span id="cb20-2"><a href="#cb20-2" aria-hidden="true"></a><span class="kw">names</span>(<span class="kw">attributes</span>(dt))</span> <span id="cb20-3"><a href="#cb20-3" aria-hidden="true"></a><span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span></span> <span id="cb20-4"><a href="#cb20-4" aria-hidden="true"></a></span> <span id="cb20-5"><a href="#cb20-5" aria-hidden="true"></a><span class="co">## run thefirst time</span></span> <span id="cb20-6"><a href="#cb20-6" aria-hidden="true"></a>(t1 <-<span class="st"> </span><span class="kw">system.time</span>(ans <-<span class="st"> </span>dt[x <span class="op">==</span><span class="st"> </span>989L]))</span> <span id="cb20-7"><a href="#cb20-7" aria-hidden="true"></a><span class="co"># user system elapsed </span></span> <span id="cb20-8"><a href="#cb20-8" aria-hidden="true"></a><span class="co"># 0.694 0.028 0.126</span></span> <span id="cb20-9"><a href="#cb20-9" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb20-10"><a href="#cb20-10" aria-hidden="true"></a><span class="co"># x y</span></span> <span id="cb20-11"><a href="#cb20-11" aria-hidden="true"></a><span class="co"># 1: 989 0.7757157</span></span> <span id="cb20-12"><a href="#cb20-12" aria-hidden="true"></a><span class="co"># 2: 989 0.6813302</span></span> <span id="cb20-13"><a href="#cb20-13" aria-hidden="true"></a><span class="co"># 3: 989 0.2815894</span></span> <span id="cb20-14"><a href="#cb20-14" aria-hidden="true"></a><span class="co"># 4: 989 0.4954259</span></span> <span id="cb20-15"><a href="#cb20-15" aria-hidden="true"></a><span class="co"># 5: 989 0.7885886</span></span> <span id="cb20-16"><a href="#cb20-16" aria-hidden="true"></a><span class="co"># 6: 989 0.5547504</span></span> <span id="cb20-17"><a href="#cb20-17" aria-hidden="true"></a></span> <span id="cb20-18"><a href="#cb20-18" aria-hidden="true"></a><span class="co">## secondary index is created</span></span> <span id="cb20-19"><a href="#cb20-19" aria-hidden="true"></a><span class="kw">names</span>(<span class="kw">attributes</span>(dt))</span> <span id="cb20-20"><a href="#cb20-20" aria-hidden="true"></a><span class="co"># [1] "names" "row.names" "class" ".internal.selfref"</span></span> <span id="cb20-21"><a href="#cb20-21" aria-hidden="true"></a><span class="co"># [5] "index"</span></span> <span id="cb20-22"><a href="#cb20-22" aria-hidden="true"></a></span> <span id="cb20-23"><a href="#cb20-23" aria-hidden="true"></a><span class="kw">indices</span>(dt)</span> <span id="cb20-24"><a href="#cb20-24" aria-hidden="true"></a><span class="co"># [1] "x"</span></span></code></pre></div> <p>The time to subset the first time is the time to create the index + the time to subset. Since creating a secondary index involves only creating the order vector, this combined operation is faster than vector scans in many cases. But the real advantage comes in successive subsets. They are extremely fast.</p> <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true"></a><span class="co">## successive subsets</span></span> <span id="cb21-2"><a href="#cb21-2" aria-hidden="true"></a>(t2 <-<span class="st"> </span><span class="kw">system.time</span>(dt[x <span class="op">==</span><span class="st"> </span>989L]))</span> <span id="cb21-3"><a href="#cb21-3" aria-hidden="true"></a><span class="co"># user system elapsed </span></span> <span id="cb21-4"><a href="#cb21-4" aria-hidden="true"></a><span class="co"># 0.001 0.000 0.001</span></span> <span id="cb21-5"><a href="#cb21-5" aria-hidden="true"></a><span class="kw">system.time</span>(dt[x <span class="op">%in%</span><span class="st"> </span><span class="dv">1989</span><span class="op">:</span><span class="dv">2012</span>])</span> <span id="cb21-6"><a href="#cb21-6" aria-hidden="true"></a><span class="co"># user system elapsed </span></span> <span id="cb21-7"><a href="#cb21-7" aria-hidden="true"></a><span class="co"># 0.001 0.000 0.001</span></span></code></pre></div> <ul> <li><p>Running the first time took 0.126 seconds where as the second time took 0.001 seconds.</p></li> <li><p>Auto indexing can be disabled by setting the global argument <code>options(datatable.auto.index = FALSE)</code>.</p></li> <li><p>Disabling auto indexing still allows to use indices created explicitly with <code>setindex</code> or <code>setindexv</code>. You can disable indices fully by setting global argument <code>options(datatable.use.index = FALSE)</code>.</p></li> </ul> </div> </div> <div id="section-2" class="section level1"> <h1></h1> <p>In recent version we extended auto indexing to expressions involving more than one column (combined with <code>&</code> operator). In the future, we plan to extend binary search to work with more binary operators like <code><</code>, <code><=</code>, <code>></code> and <code>>=</code>.</p> <p>We will discuss fast <em>subsets</em> using keys and secondary indices to <em>joins</em> in the next vignette, <em>“Joins and rolling joins”</em>.</p> <hr /> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>