EVOLUTION-MANAGER
Edit File: datatable-keys-fast-subset.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="date" content="2022-10-09" /> <title>Keys and fast binary search based subset</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <script>// Hide empty <a> tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); </script> <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} </style> <style type="text/css"> code { white-space: pre; } .sourceCode { overflow: visible; } </style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } var j = 0; while (j < rules.length) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") { j++; continue; } var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') { j++; continue; } // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Keys and fast binary search based subset</h1> <h4 class="date">2022-10-09</h4> <p>This vignette is aimed at those who are already familiar with <em>data.table</em> syntax, its general form, how to subset rows in <code>i</code>, select and compute on columns, add/modify/delete columns <em>by reference</em> in <code>j</code> and group by using <code>by</code>. If you’re not familiar with these concepts, please read the <em>“Introduction to data.table”</em> and <em>“Reference semantics”</em> vignettes first.</p> <hr /> <div id="data" class="section level2"> <h2>Data</h2> <p>We will use the same <code>flights</code> data as in the <em>“Introduction to data.table”</em> vignette.</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true"></a>flights <-<span class="st"> </span><span class="kw">fread</span>(<span class="st">"flights14.csv"</span>)</span> <span id="cb1-2"><a href="#cb1-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb1-3"><a href="#cb1-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb1-4"><a href="#cb1-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb1-5"><a href="#cb1-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb1-6"><a href="#cb1-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb1-7"><a href="#cb1-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7</span></span> <span id="cb1-8"><a href="#cb1-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb1-9"><a href="#cb1-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span> <span id="cb1-10"><a href="#cb1-10" aria-hidden="true"></a><span class="kw">dim</span>(flights)</span> <span id="cb1-11"><a href="#cb1-11" aria-hidden="true"></a><span class="co"># [1] 253316 11</span></span></code></pre></div> </div> <div id="introduction" class="section level2"> <h2>Introduction</h2> <p>In this vignette, we will</p> <ul> <li><p>first introduce the concept of <code>key</code> in <em>data.table</em>, and set and use keys to perform <em>fast binary search</em> based subsets in <code>i</code>,</p></li> <li><p>see that we can combine key based subsets along with <code>j</code> and <code>by</code> in the exact same way as before,</p></li> <li><p>look at other additional useful arguments - <code>mult</code> and <code>nomatch</code>,</p></li> <li><p>and finally conclude by looking at the advantage of setting keys - perform <em>fast binary search based subsets</em> and compare with the traditional vector scan approach.</p></li> </ul> </div> <div id="keys" class="section level2"> <h2>1. Keys</h2> <div id="a-what-is-a-key" class="section level3"> <h3>a) What is a <em>key</em>?</h3> <p>In the <em>“Introduction to data.table”</em> vignette, we saw how to subset rows in <code>i</code> using logical expressions, row numbers and using <code>order()</code>. In this section, we will look at another way of subsetting incredibly fast - using <em>keys</em>.</p> <p>But first, let’s start by looking at <em>data.frames</em>. All <em>data.frames</em> have a row names attribute. Consider the <em>data.frame</em> <code>DF</code> below.</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true"></a><span class="kw">set.seed</span>(1L)</span> <span id="cb2-2"><a href="#cb2-2" aria-hidden="true"></a>DF =<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">ID1 =</span> <span class="kw">sample</span>(letters[<span class="dv">1</span><span class="op">:</span><span class="dv">2</span>], <span class="dv">10</span>, <span class="ot">TRUE</span>),</span> <span id="cb2-3"><a href="#cb2-3" aria-hidden="true"></a> <span class="dt">ID2 =</span> <span class="kw">sample</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">3</span>, <span class="dv">10</span>, <span class="ot">TRUE</span>),</span> <span id="cb2-4"><a href="#cb2-4" aria-hidden="true"></a> <span class="dt">val =</span> <span class="kw">sample</span>(<span class="dv">10</span>),</span> <span id="cb2-5"><a href="#cb2-5" aria-hidden="true"></a> <span class="dt">stringsAsFactors =</span> <span class="ot">FALSE</span>,</span> <span id="cb2-6"><a href="#cb2-6" aria-hidden="true"></a> <span class="dt">row.names =</span> <span class="kw">sample</span>(LETTERS[<span class="dv">1</span><span class="op">:</span><span class="dv">10</span>]))</span> <span id="cb2-7"><a href="#cb2-7" aria-hidden="true"></a>DF</span> <span id="cb2-8"><a href="#cb2-8" aria-hidden="true"></a><span class="co"># ID1 ID2 val</span></span> <span id="cb2-9"><a href="#cb2-9" aria-hidden="true"></a><span class="co"># I a 1 10</span></span> <span id="cb2-10"><a href="#cb2-10" aria-hidden="true"></a><span class="co"># D a 3 9</span></span> <span id="cb2-11"><a href="#cb2-11" aria-hidden="true"></a><span class="co"># G a 1 4</span></span> <span id="cb2-12"><a href="#cb2-12" aria-hidden="true"></a><span class="co"># A a 1 7</span></span> <span id="cb2-13"><a href="#cb2-13" aria-hidden="true"></a><span class="co"># B a 1 1</span></span> <span id="cb2-14"><a href="#cb2-14" aria-hidden="true"></a><span class="co"># E b 1 8</span></span> <span id="cb2-15"><a href="#cb2-15" aria-hidden="true"></a><span class="co"># C b 2 3</span></span> <span id="cb2-16"><a href="#cb2-16" aria-hidden="true"></a><span class="co"># J b 1 2</span></span> <span id="cb2-17"><a href="#cb2-17" aria-hidden="true"></a><span class="co"># F b 1 5</span></span> <span id="cb2-18"><a href="#cb2-18" aria-hidden="true"></a><span class="co"># H a 2 6</span></span> <span id="cb2-19"><a href="#cb2-19" aria-hidden="true"></a></span> <span id="cb2-20"><a href="#cb2-20" aria-hidden="true"></a><span class="kw">rownames</span>(DF)</span> <span id="cb2-21"><a href="#cb2-21" aria-hidden="true"></a><span class="co"># [1] "I" "D" "G" "A" "B" "E" "C" "J" "F" "H"</span></span></code></pre></div> <p>We can <em>subset</em> a particular row using its row name as shown below:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true"></a>DF[<span class="st">"C"</span>, ]</span> <span id="cb3-2"><a href="#cb3-2" aria-hidden="true"></a><span class="co"># ID1 ID2 val</span></span> <span id="cb3-3"><a href="#cb3-3" aria-hidden="true"></a><span class="co"># C b 2 3</span></span></code></pre></div> <p>i.e., row names are more or less <em>an index</em> to rows of a <em>data.frame</em>. However,</p> <ol style="list-style-type: decimal"> <li><p>Each row is limited to <em>exactly one</em> row name.</p> <p>But, a person (for example) has at least two names - a <em>first</em> and a <em>second</em> name. It is useful to organise a telephone directory by <em>surname</em> then <em>first name</em>.</p></li> <li><p>And row names should be <em>unique</em>.</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true"></a><span class="kw">rownames</span>(DF) =<span class="st"> </span><span class="kw">sample</span>(LETTERS[<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>], <span class="dv">10</span>, <span class="ot">TRUE</span>)</span> <span id="cb4-2"><a href="#cb4-2" aria-hidden="true"></a><span class="co"># Warning: non-unique values when setting 'row.names': 'C', 'D'</span></span> <span id="cb4-3"><a href="#cb4-3" aria-hidden="true"></a><span class="co"># Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed</span></span></code></pre></div></li> </ol> <p>Now let’s convert it to a <em>data.table</em>.</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true"></a>DT =<span class="st"> </span><span class="kw">as.data.table</span>(DF)</span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true"></a>DT</span> <span id="cb5-3"><a href="#cb5-3" aria-hidden="true"></a><span class="co"># ID1 ID2 val</span></span> <span id="cb5-4"><a href="#cb5-4" aria-hidden="true"></a><span class="co"># 1: a 1 10</span></span> <span id="cb5-5"><a href="#cb5-5" aria-hidden="true"></a><span class="co"># 2: a 3 9</span></span> <span id="cb5-6"><a href="#cb5-6" aria-hidden="true"></a><span class="co"># 3: a 1 4</span></span> <span id="cb5-7"><a href="#cb5-7" aria-hidden="true"></a><span class="co"># 4: a 1 7</span></span> <span id="cb5-8"><a href="#cb5-8" aria-hidden="true"></a><span class="co"># 5: a 1 1</span></span> <span id="cb5-9"><a href="#cb5-9" aria-hidden="true"></a><span class="co"># 6: b 1 8</span></span> <span id="cb5-10"><a href="#cb5-10" aria-hidden="true"></a><span class="co"># 7: b 2 3</span></span> <span id="cb5-11"><a href="#cb5-11" aria-hidden="true"></a><span class="co"># 8: b 1 2</span></span> <span id="cb5-12"><a href="#cb5-12" aria-hidden="true"></a><span class="co"># 9: b 1 5</span></span> <span id="cb5-13"><a href="#cb5-13" aria-hidden="true"></a><span class="co"># 10: a 2 6</span></span> <span id="cb5-14"><a href="#cb5-14" aria-hidden="true"></a></span> <span id="cb5-15"><a href="#cb5-15" aria-hidden="true"></a><span class="kw">rownames</span>(DT)</span> <span id="cb5-16"><a href="#cb5-16" aria-hidden="true"></a><span class="co"># [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"</span></span></code></pre></div> <ul> <li><p>Note that row names have been reset.</p></li> <li><p><em>data.tables</em> never uses row names. Since <em>data.tables</em> <strong>inherit</strong> from <em>data.frames</em>, it still has the row names attribute. But it never uses them. We’ll see in a moment as to why.</p> <p>If you would like to preserve the row names, use <code>keep.rownames = TRUE</code> in <code>as.data.table()</code> - this will create a new column called <code>rn</code> and assign row names to this column.</p></li> </ul> <p>Instead, in <em>data.tables</em> we set and use <code>keys</code>. Think of a <code>key</code> as <strong>supercharged rownames</strong>.</p> <div id="key-properties" class="section level4"> <h4>Keys and their properties</h4> <ol style="list-style-type: decimal"> <li><p>We can set keys on <em>multiple columns</em> and the column can be of <em>different types</em> – <em>integer</em>, <em>numeric</em>, <em>character</em>, <em>factor</em>, <em>integer64</em> etc. <em>list</em> and <em>complex</em> types are not supported yet.</p></li> <li><p>Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.</p></li> <li><p>Setting a <code>key</code> does <em>two</em> things:</p> <ol style="list-style-type: lower-alpha"> <li><p>physically reorders the rows of the <em>data.table</em> by the column(s) provided <em>by reference</em>, always in <em>increasing</em> order.</p></li> <li><p>marks those columns as <em>key</em> columns by setting an attribute called <code>sorted</code> to the <em>data.table</em>.</p></li> </ol> <p>Since the rows are reordered, a <em>data.table</em> can have at most one key because it can not be sorted in more than one way.</p></li> </ol> <p>For the rest of the vignette, we will work with <code>flights</code> data set.</p> </div> </div> <div id="b-set-get-and-use-keys-on-a-data.table" class="section level3"> <h3>b) Set, get and use keys on a <em>data.table</em></h3> <div id="how-can-we-set-the-column-origin-as-key-in-the-data.table-flights" class="section level4"> <h4>– How can we set the column <code>origin</code> as key in the <em>data.table</em> <code>flights</code>?</h4> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true"></a><span class="kw">setkey</span>(flights, origin)</span> <span id="cb6-2"><a href="#cb6-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb6-3"><a href="#cb6-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb6-4"><a href="#cb6-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 4 0 AA EWR LAX 339 2454 18</span></span> <span id="cb6-5"><a href="#cb6-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16</span></span> <span id="cb6-6"><a href="#cb6-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 191 185 AA EWR DFW 214 1372 16</span></span> <span id="cb6-7"><a href="#cb6-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 -1 -2 AA EWR DFW 214 1372 14</span></span> <span id="cb6-8"><a href="#cb6-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6</span></span> <span id="cb6-9"><a href="#cb6-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 1 4 -17 AA EWR DFW 215 1372 9</span></span> <span id="cb6-10"><a href="#cb6-10" aria-hidden="true"></a></span> <span id="cb6-11"><a href="#cb6-11" aria-hidden="true"></a><span class="co">## alternatively we can provide character vectors to the function 'setkeyv()'</span></span> <span id="cb6-12"><a href="#cb6-12" aria-hidden="true"></a><span class="co"># setkeyv(flights, "origin") # useful to program with</span></span></code></pre></div> <ul> <li><p>You can use the function <code>setkey()</code> and provide the column names (without quoting them). This is helpful during interactive use.</p></li> <li><p>Alternatively you can pass a character vector of column names to the function <code>setkeyv()</code>. This is particularly useful while designing functions to pass columns to set key on as function arguments.</p></li> <li><p>Note that we did not have to assign the result back to a variable. This is because like the <code>:=</code> function we saw in the <em>“Introduction to data.table”</em> vignette, <code>setkey()</code> and <code>setkeyv()</code> modify the input <em>data.table</em> <em>by reference</em>. They return the result invisibly.</p></li> <li><p>The <em>data.table</em> is now reordered (or sorted) by the column we provided - <code>origin</code>. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the <em>data.table</em>, and is therefore very memory efficient.</p></li> <li><p>You can also set keys directly when creating <em>data.tables</em> using the <code>data.table()</code> function using <code>key</code> argument. It takes a character vector of column names.</p></li> </ul> </div> <div id="set-and" class="section level4"> <h4>set* and <code>:=</code>:</h4> <p>In <em>data.table</em>, the <code>:=</code> operator and all the <code>set*</code> (e.g., <code>setkey</code>, <code>setorder</code>, <code>setnames</code> etc..) functions are the only ones which modify the input object <em>by reference</em>.</p> <p>Once you <em>key</em> a <em>data.table</em> by certain columns, you can subset by querying those key columns using the <code>.()</code> notation in <code>i</code>. Recall that <code>.()</code> is an <em>alias to</em> <code>list()</code>.</p> </div> <div id="use-the-key-column-origin-to-subset-all-rows-where-the-origin-airport-matches-jfk" class="section level4"> <h4>– Use the key column <code>origin</code> to subset all rows where the origin airport matches <em>“JFK”</em></h4> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>)]</span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb7-3"><a href="#cb7-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9</span></span> <span id="cb7-4"><a href="#cb7-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11</span></span> <span id="cb7-5"><a href="#cb7-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19</span></span> <span id="cb7-6"><a href="#cb7-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13</span></span> <span id="cb7-7"><a href="#cb7-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21</span></span> <span id="cb7-8"><a href="#cb7-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb7-9"><a href="#cb7-9" aria-hidden="true"></a><span class="co"># 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17</span></span> <span id="cb7-10"><a href="#cb7-10" aria-hidden="true"></a><span class="co"># 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18</span></span> <span id="cb7-11"><a href="#cb7-11" aria-hidden="true"></a><span class="co"># 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17</span></span> <span id="cb7-12"><a href="#cb7-12" aria-hidden="true"></a><span class="co"># 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9</span></span> <span id="cb7-13"><a href="#cb7-13" aria-hidden="true"></a><span class="co"># 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11</span></span> <span id="cb7-14"><a href="#cb7-14" aria-hidden="true"></a></span> <span id="cb7-15"><a href="#cb7-15" aria-hidden="true"></a><span class="co">## alternatively</span></span> <span id="cb7-16"><a href="#cb7-16" aria-hidden="true"></a><span class="co"># flights[J("JFK")] (or) </span></span> <span id="cb7-17"><a href="#cb7-17" aria-hidden="true"></a><span class="co"># flights[list("JFK")]</span></span></code></pre></div> <ul> <li><p>The <em>key</em> column has already been set to <code>origin</code>. So it is sufficient to provide the value, here <em>“JFK”</em>, directly. The <code>.()</code> syntax helps identify that the task requires looking up the value <em>“JFK”</em> in the key column of <em>data.table</em> (here column <code>origin</code> of <code>flights</code> <em>data.table</em>).</p></li> <li><p>The <em>row indices</em> corresponding to the value <em>“JFK”</em> in <code>origin</code> is obtained first. And since there is no expression in <code>j</code>, all columns corresponding to those row indices are returned.</p></li> <li><p>On single column key of <em>character</em> type, you can drop the <code>.()</code> notation and use the values directly when subsetting, like subset using row names on <em>data.frames</em>.</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true"></a>flights[<span class="st">"JFK"</span>] <span class="co">## same as flights[.("JFK")]</span></span></code></pre></div></li> <li><p>We can subset any amount of values as required</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true"></a>flights[<span class="kw">c</span>(<span class="st">"JFK"</span>, <span class="st">"LGA"</span>)] <span class="co">## same as flights[.(c("JFK", "LGA"))]</span></span></code></pre></div> <p>This returns all columns corresponding to those rows where <code>origin</code> column matches either <em>“JFK”</em> or <em>“LGA”</em>.</p></li> </ul> </div> <div id="how-can-we-get-the-columns-a-data.table-is-keyed-by" class="section level4"> <h4>– How can we get the column(s) a <em>data.table</em> is keyed by?</h4> <p>Using the function <code>key()</code>.</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true"></a><span class="co"># [1] "origin"</span></span></code></pre></div> <ul> <li><p>It returns a character vector of all the key columns.</p></li> <li><p>If no key is set, it returns <code>NULL</code>.</p></li> </ul> </div> </div> <div id="c-keys-and-multiple-columns" class="section level3"> <h3>c) Keys and multiple columns</h3> <p>To refresh, <em>keys</em> are like <em>supercharged</em> row names. We can set key on multiple columns and they can be of multiple types.</p> <div id="how-can-i-set-keys-on-both-origin-and-dest-columns" class="section level4"> <h4>– How can I set keys on both <code>origin</code> <em>and</em> <code>dest</code> columns?</h4> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true"></a><span class="kw">setkey</span>(flights, origin, dest)</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true"></a><span class="kw">head</span>(flights)</span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true"></a><span class="co"># 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7</span></span> <span id="cb11-5"><a href="#cb11-5" aria-hidden="true"></a><span class="co"># 2: 2014 1 3 88 79 EV EWR ALB 29 143 23</span></span> <span id="cb11-6"><a href="#cb11-6" aria-hidden="true"></a><span class="co"># 3: 2014 1 4 220 211 EV EWR ALB 32 143 15</span></span> <span id="cb11-7"><a href="#cb11-7" aria-hidden="true"></a><span class="co"># 4: 2014 1 4 35 19 EV EWR ALB 32 143 7</span></span> <span id="cb11-8"><a href="#cb11-8" aria-hidden="true"></a><span class="co"># 5: 2014 1 5 47 42 EV EWR ALB 26 143 8</span></span> <span id="cb11-9"><a href="#cb11-9" aria-hidden="true"></a><span class="co"># 6: 2014 1 5 66 62 EV EWR ALB 31 143 23</span></span> <span id="cb11-10"><a href="#cb11-10" aria-hidden="true"></a></span> <span id="cb11-11"><a href="#cb11-11" aria-hidden="true"></a><span class="co">## or alternatively</span></span> <span id="cb11-12"><a href="#cb11-12" aria-hidden="true"></a><span class="co"># setkeyv(flights, c("origin", "dest")) # provide a character vector of column names</span></span> <span id="cb11-13"><a href="#cb11-13" aria-hidden="true"></a></span> <span id="cb11-14"><a href="#cb11-14" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb11-15"><a href="#cb11-15" aria-hidden="true"></a><span class="co"># [1] "origin" "dest"</span></span></code></pre></div> <ul> <li>It sorts the <em>data.table</em> first by the column <code>origin</code> and then by <code>dest</code> <em>by reference</em>.</li> </ul> </div> <div id="subset-all-rows-using-key-columns-where-first-key-column-origin-matches-jfk-and-second-key-column-dest-matches-mia" class="section level4"> <h4>– Subset all rows using key columns where first key column <code>origin</code> matches <em>“JFK”</em> and second key column <code>dest</code> matches <em>“MIA”</em></h4> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>)]</span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb12-3"><a href="#cb12-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15</span></span> <span id="cb12-4"><a href="#cb12-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9</span></span> <span id="cb12-5"><a href="#cb12-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12</span></span> <span id="cb12-6"><a href="#cb12-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5</span></span> <span id="cb12-7"><a href="#cb12-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17</span></span> <span id="cb12-8"><a href="#cb12-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb12-9"><a href="#cb12-9" aria-hidden="true"></a><span class="co"># 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16</span></span> <span id="cb12-10"><a href="#cb12-10" aria-hidden="true"></a><span class="co"># 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8</span></span> <span id="cb12-11"><a href="#cb12-11" aria-hidden="true"></a><span class="co"># 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6</span></span> <span id="cb12-12"><a href="#cb12-12" aria-hidden="true"></a><span class="co"># 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5</span></span> <span id="cb12-13"><a href="#cb12-13" aria-hidden="true"></a><span class="co"># 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19</span></span></code></pre></div> </div> <div id="multiple-key-point" class="section level4"> <h4>How does the subset work here?</h4> <ul> <li><p>It is important to understand how this works internally. <em>“JFK”</em> is first matched against the first key column <code>origin</code>. And <em>within those matching rows</em>, <em>“MIA”</em> is matched against the second key column <code>dest</code> to obtain <em>row indices</em> where both <code>origin</code> and <code>dest</code> match the given values.</p></li> <li><p>Since no <code>j</code> is provided, we simply return <em>all columns</em> corresponding to those row indices.</p></li> </ul> </div> <div id="subset-all-rows-where-just-the-first-key-column-origin-matches-jfk" class="section level4"> <h4>– Subset all rows where just the first key column <code>origin</code> matches <em>“JFK”</em></h4> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb13-2"><a href="#cb13-2" aria-hidden="true"></a><span class="co"># [1] "origin" "dest"</span></span> <span id="cb13-3"><a href="#cb13-3" aria-hidden="true"></a></span> <span id="cb13-4"><a href="#cb13-4" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>)] <span class="co">## or in this case simply flights["JFK"], for convenience</span></span> <span id="cb13-5"><a href="#cb13-5" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb13-6"><a href="#cb13-6" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 10 4 B6 JFK ABQ 280 1826 20</span></span> <span id="cb13-7"><a href="#cb13-7" aria-hidden="true"></a><span class="co"># 2: 2014 1 2 134 161 B6 JFK ABQ 252 1826 22</span></span> <span id="cb13-8"><a href="#cb13-8" aria-hidden="true"></a><span class="co"># 3: 2014 1 7 6 6 B6 JFK ABQ 269 1826 20</span></span> <span id="cb13-9"><a href="#cb13-9" aria-hidden="true"></a><span class="co"># 4: 2014 1 8 15 -15 B6 JFK ABQ 259 1826 20</span></span> <span id="cb13-10"><a href="#cb13-10" aria-hidden="true"></a><span class="co"># 5: 2014 1 9 45 32 B6 JFK ABQ 267 1826 20</span></span> <span id="cb13-11"><a href="#cb13-11" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb13-12"><a href="#cb13-12" aria-hidden="true"></a><span class="co"># 81479: 2014 10 31 0 -18 DL JFK TPA 142 1005 8</span></span> <span id="cb13-13"><a href="#cb13-13" aria-hidden="true"></a><span class="co"># 81480: 2014 10 31 1 -8 B6 JFK TPA 149 1005 19</span></span> <span id="cb13-14"><a href="#cb13-14" aria-hidden="true"></a><span class="co"># 81481: 2014 10 31 -2 -22 B6 JFK TPA 145 1005 14</span></span> <span id="cb13-15"><a href="#cb13-15" aria-hidden="true"></a><span class="co"># 81482: 2014 10 31 -8 -5 B6 JFK TPA 149 1005 9</span></span> <span id="cb13-16"><a href="#cb13-16" aria-hidden="true"></a><span class="co"># 81483: 2014 10 31 -4 -18 B6 JFK TPA 145 1005 8</span></span></code></pre></div> <ul> <li>Since we did not provide any values for the second key column <code>dest</code>, it just matches <em>“JFK”</em> against the first key column <code>origin</code> and returns all the matched rows.</li> </ul> </div> <div id="subset-all-rows-where-just-the-second-key-column-dest-matches-mia" class="section level4"> <h4>– Subset all rows where just the second key column <code>dest</code> matches <em>“MIA”</em></h4> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true"></a>flights[.(<span class="kw">unique</span>(origin), <span class="st">"MIA"</span>)]</span> <span id="cb14-2"><a href="#cb14-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb14-3"><a href="#cb14-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16</span></span> <span id="cb14-4"><a href="#cb14-4" aria-hidden="true"></a><span class="co"># 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6</span></span> <span id="cb14-5"><a href="#cb14-5" aria-hidden="true"></a><span class="co"># 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11</span></span> <span id="cb14-6"><a href="#cb14-6" aria-hidden="true"></a><span class="co"># 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15</span></span> <span id="cb14-7"><a href="#cb14-7" aria-hidden="true"></a><span class="co"># 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21</span></span> <span id="cb14-8"><a href="#cb14-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb14-9"><a href="#cb14-9" aria-hidden="true"></a><span class="co"># 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13</span></span> <span id="cb14-10"><a href="#cb14-10" aria-hidden="true"></a><span class="co"># 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9</span></span> <span id="cb14-11"><a href="#cb14-11" aria-hidden="true"></a><span class="co"># 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6</span></span> <span id="cb14-12"><a href="#cb14-12" aria-hidden="true"></a><span class="co"># 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19</span></span> <span id="cb14-13"><a href="#cb14-13" aria-hidden="true"></a><span class="co"># 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15</span></span></code></pre></div> </div> <div id="whats-happening-here" class="section level4"> <h4>What’s happening here?</h4> <ul> <li><p>Read <a href="#multiple-key-point">this</a> again. The value provided for the second key column <em>“MIA”</em> has to find the matching values in <code>dest</code> key column <em>on the matching rows provided by the first key column <code>origin</code></em>. We can not skip the values of key columns <em>before</em>. Therefore we provide <em>all</em> unique values from key column <code>origin</code>.</p></li> <li><p><em>“MIA”</em> is automatically recycled to fit the length of <code>unique(origin)</code> which is <em>3</em>.</p></li> </ul> </div> </div> </div> <div id="combining-keys-with-j-and-by" class="section level2"> <h2>2) Combining keys with <code>j</code> and <code>by</code></h2> <p>All we have seen so far is the same concept – obtaining <em>row indices</em> in <code>i</code>, but just using a different method – using <code>keys</code>. It shouldn’t be surprising that we can do exactly the same things in <code>j</code> and <code>by</code> as seen from the previous vignettes. We will highlight this with a few examples.</p> <div id="a-select-in-j" class="section level3"> <h3>a) Select in <code>j</code></h3> <div id="return-arr_delay-column-as-a-data.table-corresponding-to-origin-lga-and-dest-tpa." class="section level4"> <h4>– Return <code>arr_delay</code> column as a <em>data.table</em> corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb15-2"><a href="#cb15-2" aria-hidden="true"></a><span class="co"># [1] "origin" "dest"</span></span> <span id="cb15-3"><a href="#cb15-3" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay)]</span> <span id="cb15-4"><a href="#cb15-4" aria-hidden="true"></a><span class="co"># arr_delay</span></span> <span id="cb15-5"><a href="#cb15-5" aria-hidden="true"></a><span class="co"># 1: 1</span></span> <span id="cb15-6"><a href="#cb15-6" aria-hidden="true"></a><span class="co"># 2: 14</span></span> <span id="cb15-7"><a href="#cb15-7" aria-hidden="true"></a><span class="co"># 3: -17</span></span> <span id="cb15-8"><a href="#cb15-8" aria-hidden="true"></a><span class="co"># 4: -4</span></span> <span id="cb15-9"><a href="#cb15-9" aria-hidden="true"></a><span class="co"># 5: -12</span></span> <span id="cb15-10"><a href="#cb15-10" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb15-11"><a href="#cb15-11" aria-hidden="true"></a><span class="co"># 1848: 39</span></span> <span id="cb15-12"><a href="#cb15-12" aria-hidden="true"></a><span class="co"># 1849: -24</span></span> <span id="cb15-13"><a href="#cb15-13" aria-hidden="true"></a><span class="co"># 1850: -12</span></span> <span id="cb15-14"><a href="#cb15-14" aria-hidden="true"></a><span class="co"># 1851: 21</span></span> <span id="cb15-15"><a href="#cb15-15" aria-hidden="true"></a><span class="co"># 1852: -11</span></span></code></pre></div> <ul> <li><p>The <em>row indices</em> corresponding to <code>origin == "LGA"</code> and <code>dest == "TPA"</code> are obtained using <em>key based subset</em>.</p></li> <li><p>Once we have the row indices, we look at <code>j</code> which requires only the <code>arr_delay</code> column. So we simply select the column <code>arr_delay</code> for those <em>row indices</em> in the exact same way as we have seen in <em>Introduction to data.table</em> vignette.</p></li> <li><p>We could have returned the result by using <code>with = FALSE</code> as well.</p> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="st">"arr_delay"</span>, with =<span class="st"> </span><span class="ot">FALSE</span>]</span></code></pre></div></li> </ul> </div> </div> <div id="b-chaining" class="section level3"> <h3>b) Chaining</h3> <div id="on-the-result-obtained-above-use-chaining-to-order-the-column-in-decreasing-order." class="section level4"> <h4>– On the result obtained above, use chaining to order the column in decreasing order.</h4> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), .(arr_delay)][<span class="kw">order</span>(<span class="op">-</span>arr_delay)]</span> <span id="cb17-2"><a href="#cb17-2" aria-hidden="true"></a><span class="co"># arr_delay</span></span> <span id="cb17-3"><a href="#cb17-3" aria-hidden="true"></a><span class="co"># 1: 486</span></span> <span id="cb17-4"><a href="#cb17-4" aria-hidden="true"></a><span class="co"># 2: 380</span></span> <span id="cb17-5"><a href="#cb17-5" aria-hidden="true"></a><span class="co"># 3: 351</span></span> <span id="cb17-6"><a href="#cb17-6" aria-hidden="true"></a><span class="co"># 4: 318</span></span> <span id="cb17-7"><a href="#cb17-7" aria-hidden="true"></a><span class="co"># 5: 300</span></span> <span id="cb17-8"><a href="#cb17-8" aria-hidden="true"></a><span class="co"># --- </span></span> <span id="cb17-9"><a href="#cb17-9" aria-hidden="true"></a><span class="co"># 1848: -40</span></span> <span id="cb17-10"><a href="#cb17-10" aria-hidden="true"></a><span class="co"># 1849: -43</span></span> <span id="cb17-11"><a href="#cb17-11" aria-hidden="true"></a><span class="co"># 1850: -46</span></span> <span id="cb17-12"><a href="#cb17-12" aria-hidden="true"></a><span class="co"># 1851: -48</span></span> <span id="cb17-13"><a href="#cb17-13" aria-hidden="true"></a><span class="co"># 1852: -49</span></span></code></pre></div> </div> </div> <div id="c-compute-or-do-in-j" class="section level3"> <h3>c) Compute or <em>do</em> in <code>j</code></h3> <div id="find-the-maximum-arrival-delay-corresponding-to-origin-lga-and-dest-tpa." class="section level4"> <h4>– Find the maximum arrival delay corresponding to <code>origin = "LGA"</code> and <code>dest = "TPA"</code>.</h4> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true"></a>flights[.(<span class="st">"LGA"</span>, <span class="st">"TPA"</span>), <span class="kw">max</span>(arr_delay)]</span> <span id="cb18-2"><a href="#cb18-2" aria-hidden="true"></a><span class="co"># [1] 486</span></span></code></pre></div> <ul> <li>We can verify that the result is identical to first value (486) from the previous example.</li> </ul> </div> </div> <div id="d-sub-assign-by-reference-using-in-j" class="section level3"> <h3>d) <em>sub-assign</em> by reference using <code>:=</code> in <code>j</code></h3> <p>We have seen this example already in the <em>Reference semantics</em> vignette. Let’s take a look at all the <code>hours</code> available in the <code>flights</code> <em>data.table</em>:</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true"></a><span class="co"># get all 'hours' in flights</span></span> <span id="cb19-2"><a href="#cb19-2" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb19-3"><a href="#cb19-3" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24</span></span></code></pre></div> <p>We see that there are totally <code>25</code> unique values in the data. Both <em>0</em> and <em>24</em> hours seem to be present. Let’s go ahead and replace <em>24</em> with <em>0</em>, but this time using <em>key</em>.</p> <div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true"></a><span class="kw">setkey</span>(flights, hour)</span> <span id="cb20-2"><a href="#cb20-2" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb20-3"><a href="#cb20-3" aria-hidden="true"></a><span class="co"># [1] "hour"</span></span> <span id="cb20-4"><a href="#cb20-4" aria-hidden="true"></a>flights[.(<span class="dv">24</span>), hour <span class="op">:</span><span class="er">=</span><span class="st"> </span>0L]</span> <span id="cb20-5"><a href="#cb20-5" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb20-6"><a href="#cb20-6" aria-hidden="true"></a><span class="co"># NULL</span></span></code></pre></div> <ul> <li><p>We first set <code>key</code> to <code>hour</code>. This reorders <code>flights</code> by the column <code>hour</code> and marks that column as the <code>key</code> column.</p></li> <li><p>Now we can subset on <code>hour</code> by using the <code>.()</code> notation. We subset for the value <em>24</em> and obtain the corresponding <em>row indices</em>.</p></li> <li><p>And on those row indices, we replace the <code>key</code> column with the value <code>0</code>.</p></li> <li><p>Since we have replaced values on the <em>key</em> column, the <em>data.table</em> <code>flights</code> isn’t sorted by <code>hour</code> any more. Therefore, the key has been automatically removed by setting to NULL.</p></li> </ul> <p>Now, there shouldn’t be any <em>24</em> in the <code>hour</code> column.</p> <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true"></a>flights[, <span class="kw">sort</span>(<span class="kw">unique</span>(hour))]</span> <span id="cb21-2"><a href="#cb21-2" aria-hidden="true"></a><span class="co"># [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</span></span></code></pre></div> </div> <div id="e-aggregation-using-by" class="section level3"> <h3>e) Aggregation using <code>by</code></h3> <p>Let’s set the key back to <code>origin, dest</code> first.</p> <div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true"></a><span class="kw">setkey</span>(flights, origin, dest)</span> <span id="cb22-2"><a href="#cb22-2" aria-hidden="true"></a><span class="kw">key</span>(flights)</span> <span id="cb22-3"><a href="#cb22-3" aria-hidden="true"></a><span class="co"># [1] "origin" "dest"</span></span></code></pre></div> <div id="get-the-maximum-departure-delay-for-each-month-corresponding-to-origin-jfk.-order-the-result-by-month" class="section level4"> <h4>– Get the maximum departure delay for each <code>month</code> corresponding to <code>origin = "JFK"</code>. Order the result by <code>month</code></h4> <div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true"></a>ans <-<span class="st"> </span>flights[<span class="st">"JFK"</span>, <span class="kw">max</span>(dep_delay), keyby =<span class="st"> </span>month]</span> <span id="cb23-2"><a href="#cb23-2" aria-hidden="true"></a><span class="kw">head</span>(ans)</span> <span id="cb23-3"><a href="#cb23-3" aria-hidden="true"></a><span class="co"># month V1</span></span> <span id="cb23-4"><a href="#cb23-4" aria-hidden="true"></a><span class="co"># 1: 1 881</span></span> <span id="cb23-5"><a href="#cb23-5" aria-hidden="true"></a><span class="co"># 2: 2 1014</span></span> <span id="cb23-6"><a href="#cb23-6" aria-hidden="true"></a><span class="co"># 3: 3 920</span></span> <span id="cb23-7"><a href="#cb23-7" aria-hidden="true"></a><span class="co"># 4: 4 1241</span></span> <span id="cb23-8"><a href="#cb23-8" aria-hidden="true"></a><span class="co"># 5: 5 853</span></span> <span id="cb23-9"><a href="#cb23-9" aria-hidden="true"></a><span class="co"># 6: 6 798</span></span> <span id="cb23-10"><a href="#cb23-10" aria-hidden="true"></a><span class="kw">key</span>(ans)</span> <span id="cb23-11"><a href="#cb23-11" aria-hidden="true"></a><span class="co"># [1] "month"</span></span></code></pre></div> <ul> <li><p>We subset on the <code>key</code> column <em>origin</em> to obtain the <em>row indices</em> corresponding to <em>“JFK”</em>.</p></li> <li><p>Once we obtain the row indices, we only need two columns - <code>month</code> to group by and <code>dep_delay</code> to obtain <code>max()</code> for each group. <em>data.table’s</em> query optimisation therefore subsets just those two columns corresponding to the <em>row indices</em> obtained in <code>i</code>, for speed and memory efficiency.</p></li> <li><p>And on that subset, we group by <em>month</em> and compute <code>max(dep_delay)</code>.</p></li> <li><p>We use <code>keyby</code> to automatically key that result by <em>month</em>. Now we understand what that means. In addition to ordering, it also sets <em>month</em> as the <code>key</code> column.</p></li> </ul> </div> </div> </div> <div id="additional-arguments---mult-and-nomatch" class="section level2"> <h2>3) Additional arguments - <code>mult</code> and <code>nomatch</code></h2> <div id="a-the-mult-argument" class="section level3"> <h3>a) The <em>mult</em> argument</h3> <p>We can choose, for each query, if <em>“all”</em> the matching rows should be returned, or just the <em>“first”</em> or <em>“last”</em> using the <code>mult</code> argument. The default value is <em>“all”</em> - what we’ve seen so far.</p> <div id="subset-only-the-first-matching-row-from-all-rows-where-origin-matches-jfk-and-dest-matches-mia" class="section level4"> <h4>– Subset only the first matching row from all rows where <code>origin</code> matches <em>“JFK”</em> and <code>dest</code> matches <em>“MIA”</em></h4> <div class="sourceCode" id="cb24"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>), mult =<span class="st"> "first"</span>]</span> <span id="cb24-2"><a href="#cb24-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb24-3"><a href="#cb24-3" aria-hidden="true"></a><span class="co"># 1: 2014 1 1 6 3 AA JFK MIA 157 1089 5</span></span></code></pre></div> </div> <div id="subset-only-the-last-matching-row-of-all-the-rows-where-origin-matches-lga-jfk-ewr-and-dest-matches-xna" class="section level4"> <h4>– Subset only the last matching row of all the rows where <code>origin</code> matches <em>“LGA”, “JFK”, “EWR”</em> and <code>dest</code> matches <em>“XNA”</em></h4> <div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true"></a>flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>]</span> <span id="cb25-2"><a href="#cb25-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb25-3"><a href="#cb25-3" aria-hidden="true"></a><span class="co"># 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18</span></span> <span id="cb25-4"><a href="#cb25-4" aria-hidden="true"></a><span class="co"># 2: NA NA NA NA NA <NA> JFK XNA NA NA NA</span></span> <span id="cb25-5"><a href="#cb25-5" aria-hidden="true"></a><span class="co"># 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12</span></span></code></pre></div> <ul> <li><p>The query <em>“JFK”, “XNA”</em> doesn’t match any rows in <code>flights</code> and therefore returns <code>NA</code>.</p></li> <li><p>Once again, the query for second key column <code>dest</code>, <em>“XNA”</em>, is recycled to fit the length of the query for first key column <code>origin</code>, which is of length 3.</p></li> </ul> </div> </div> <div id="b-the-nomatch-argument" class="section level3"> <h3>b) The <em>nomatch</em> argument</h3> <p>We can choose if queries that do not match should return <code>NA</code> or be skipped altogether using the <code>nomatch</code> argument.</p> <div id="from-the-previous-example-subset-all-rows-only-if-theres-a-match" class="section level4"> <h4>– From the previous example, Subset all rows only if there’s a match</h4> <div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true"></a>flights[.(<span class="kw">c</span>(<span class="st">"LGA"</span>, <span class="st">"JFK"</span>, <span class="st">"EWR"</span>), <span class="st">"XNA"</span>), mult =<span class="st"> "last"</span>, nomatch =<span class="st"> </span><span class="ot">NULL</span>]</span> <span id="cb26-2"><a href="#cb26-2" aria-hidden="true"></a><span class="co"># year month day dep_delay arr_delay carrier origin dest air_time distance hour</span></span> <span id="cb26-3"><a href="#cb26-3" aria-hidden="true"></a><span class="co"># 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18</span></span> <span id="cb26-4"><a href="#cb26-4" aria-hidden="true"></a><span class="co"># 2: 2014 2 3 231 268 EV EWR XNA 184 1131 12</span></span></code></pre></div> <ul> <li><p>Default value for <code>nomatch</code> is <code>NA</code>. Setting <code>nomatch = NULL</code> skips queries with no matches.</p></li> <li><p>The query “JFK”, “XNA” doesn’t match any rows in flights and therefore is skipped.</p></li> </ul> </div> </div> </div> <div id="binary-search-vs-vector-scans" class="section level2"> <h2>4) binary search vs vector scans</h2> <p>We have seen so far how we can set and use keys to subset. But what’s the advantage? For example, instead of doing:</p> <div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true"></a><span class="co"># key by origin,dest columns</span></span> <span id="cb27-2"><a href="#cb27-2" aria-hidden="true"></a>flights[.(<span class="st">"JFK"</span>, <span class="st">"MIA"</span>)]</span></code></pre></div> <p>we could have done:</p> <div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true"></a>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> "MIA"</span>]</span></code></pre></div> <p>One advantage very likely is shorter syntax. But even more than that, <em>binary search based subsets</em> are <strong>incredibly fast</strong>.</p> <p>As the time goes <code>data.table</code> gets new optimization and currently the latter call is automatically optimized to use <em>binary search</em>.<br /> To use slow <em>vector scan</em> key needs to be removed.</p> <div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true"></a><span class="kw">setkey</span>(flights, <span class="ot">NULL</span>)</span> <span id="cb29-2"><a href="#cb29-2" aria-hidden="true"></a>flights[origin <span class="op">==</span><span class="st"> "JFK"</span> <span class="op">&</span><span class="st"> </span>dest <span class="op">==</span><span class="st"> "MIA"</span>]</span></code></pre></div> <div id="a-performance-of-binary-search-approach" class="section level3"> <h3>a) Performance of binary search approach</h3> <p>To illustrate, let’s create a sample <em>data.table</em> with 20 million rows and three columns and key it by columns <code>x</code> and <code>y</code>.</p> <div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true"></a><span class="kw">set.seed</span>(2L)</span> <span id="cb30-2"><a href="#cb30-2" aria-hidden="true"></a>N =<span class="st"> </span><span class="fl">2e7</span>L</span> <span id="cb30-3"><a href="#cb30-3" aria-hidden="true"></a>DT =<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">x =</span> <span class="kw">sample</span>(letters, N, <span class="ot">TRUE</span>),</span> <span id="cb30-4"><a href="#cb30-4" aria-hidden="true"></a> <span class="dt">y =</span> <span class="kw">sample</span>(1000L, N, <span class="ot">TRUE</span>),</span> <span id="cb30-5"><a href="#cb30-5" aria-hidden="true"></a> <span class="dt">val =</span> <span class="kw">runif</span>(N))</span> <span id="cb30-6"><a href="#cb30-6" aria-hidden="true"></a><span class="kw">print</span>(<span class="kw">object.size</span>(DT), <span class="dt">units =</span> <span class="st">"Mb"</span>)</span> <span id="cb30-7"><a href="#cb30-7" aria-hidden="true"></a><span class="co"># 381.5 Mb</span></span></code></pre></div> <p><code>DT</code> is ~380MB. It is not really huge, but this will do to illustrate the point.</p> <p>From what we have seen in the Introduction to data.table section, we can subset those rows where columns <code>x = "g"</code> and <code>y = 877</code> as follows:</p> <div class="sourceCode" id="cb31"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true"></a><span class="kw">key</span>(DT)</span> <span id="cb31-2"><a href="#cb31-2" aria-hidden="true"></a><span class="co"># NULL</span></span> <span id="cb31-3"><a href="#cb31-3" aria-hidden="true"></a><span class="co">## (1) Usual way of subsetting - vector scan approach</span></span> <span id="cb31-4"><a href="#cb31-4" aria-hidden="true"></a>t1 <-<span class="st"> </span><span class="kw">system.time</span>(ans1 <-<span class="st"> </span>DT[x <span class="op">==</span><span class="st"> "g"</span> <span class="op">&</span><span class="st"> </span>y <span class="op">==</span><span class="st"> </span>877L])</span> <span id="cb31-5"><a href="#cb31-5" aria-hidden="true"></a>t1</span> <span id="cb31-6"><a href="#cb31-6" aria-hidden="true"></a><span class="co"># user system elapsed </span></span> <span id="cb31-7"><a href="#cb31-7" aria-hidden="true"></a><span class="co"># 0.755 0.092 0.166</span></span> <span id="cb31-8"><a href="#cb31-8" aria-hidden="true"></a><span class="kw">head</span>(ans1)</span> <span id="cb31-9"><a href="#cb31-9" aria-hidden="true"></a><span class="co"># x y val</span></span> <span id="cb31-10"><a href="#cb31-10" aria-hidden="true"></a><span class="co"># 1: g 877 0.57059767</span></span> <span id="cb31-11"><a href="#cb31-11" aria-hidden="true"></a><span class="co"># 2: g 877 0.74859806</span></span> <span id="cb31-12"><a href="#cb31-12" aria-hidden="true"></a><span class="co"># 3: g 877 0.03616756</span></span> <span id="cb31-13"><a href="#cb31-13" aria-hidden="true"></a><span class="co"># 4: g 877 0.28087868</span></span> <span id="cb31-14"><a href="#cb31-14" aria-hidden="true"></a><span class="co"># 5: g 877 0.83727299</span></span> <span id="cb31-15"><a href="#cb31-15" aria-hidden="true"></a><span class="co"># 6: g 877 0.43867189</span></span> <span id="cb31-16"><a href="#cb31-16" aria-hidden="true"></a><span class="kw">dim</span>(ans1)</span> <span id="cb31-17"><a href="#cb31-17" aria-hidden="true"></a><span class="co"># [1] 762 3</span></span></code></pre></div> <p>Now let’s try to subset by using keys.</p> <div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true"></a><span class="kw">setkeyv</span>(DT, <span class="kw">c</span>(<span class="st">"x"</span>, <span class="st">"y"</span>))</span> <span id="cb32-2"><a href="#cb32-2" aria-hidden="true"></a><span class="kw">key</span>(DT)</span> <span id="cb32-3"><a href="#cb32-3" aria-hidden="true"></a><span class="co"># [1] "x" "y"</span></span> <span id="cb32-4"><a href="#cb32-4" aria-hidden="true"></a><span class="co">## (2) Subsetting using keys</span></span> <span id="cb32-5"><a href="#cb32-5" aria-hidden="true"></a>t2 <-<span class="st"> </span><span class="kw">system.time</span>(ans2 <-<span class="st"> </span>DT[.(<span class="st">"g"</span>, 877L)])</span> <span id="cb32-6"><a href="#cb32-6" aria-hidden="true"></a>t2</span> <span id="cb32-7"><a href="#cb32-7" aria-hidden="true"></a><span class="co"># user system elapsed </span></span> <span id="cb32-8"><a href="#cb32-8" aria-hidden="true"></a><span class="co"># 0.001 0.000 0.000</span></span> <span id="cb32-9"><a href="#cb32-9" aria-hidden="true"></a><span class="kw">head</span>(ans2)</span> <span id="cb32-10"><a href="#cb32-10" aria-hidden="true"></a><span class="co"># x y val</span></span> <span id="cb32-11"><a href="#cb32-11" aria-hidden="true"></a><span class="co"># 1: g 877 0.57059767</span></span> <span id="cb32-12"><a href="#cb32-12" aria-hidden="true"></a><span class="co"># 2: g 877 0.74859806</span></span> <span id="cb32-13"><a href="#cb32-13" aria-hidden="true"></a><span class="co"># 3: g 877 0.03616756</span></span> <span id="cb32-14"><a href="#cb32-14" aria-hidden="true"></a><span class="co"># 4: g 877 0.28087868</span></span> <span id="cb32-15"><a href="#cb32-15" aria-hidden="true"></a><span class="co"># 5: g 877 0.83727299</span></span> <span id="cb32-16"><a href="#cb32-16" aria-hidden="true"></a><span class="co"># 6: g 877 0.43867189</span></span> <span id="cb32-17"><a href="#cb32-17" aria-hidden="true"></a><span class="kw">dim</span>(ans2)</span> <span id="cb32-18"><a href="#cb32-18" aria-hidden="true"></a><span class="co"># [1] 762 3</span></span> <span id="cb32-19"><a href="#cb32-19" aria-hidden="true"></a></span> <span id="cb32-20"><a href="#cb32-20" aria-hidden="true"></a><span class="kw">identical</span>(ans1<span class="op">$</span>val, ans2<span class="op">$</span>val)</span> <span id="cb32-21"><a href="#cb32-21" aria-hidden="true"></a><span class="co"># [1] TRUE</span></span></code></pre></div> <ul> <li>The speed-up is <strong>~166x</strong>!</li> </ul> </div> <div id="b-why-does-keying-a-data.table-result-in-blazing-fast-subsets" class="section level3"> <h3>b) Why does keying a <em>data.table</em> result in blazing fast subsets?</h3> <p>To understand that, let’s first look at what <em>vector scan approach</em> (method 1) does.</p> <div id="vector-scan-approach" class="section level4"> <h4>Vector scan approach</h4> <ul> <li><p>The column <code>x</code> is searched for the value <em>“g”</em> row by row, on all 20 million of them. This results in a <em>logical vector</em> of size 20 million, with values <code>TRUE, FALSE or NA</code> corresponding to <code>x</code>’s value.</p></li> <li><p>Similarly, the column <code>y</code> is searched for <code>877</code> on all 20 million rows one by one, and stored in another logical vector.</p></li> <li><p>Element wise <code>&</code> operations are performed on the intermediate logical vectors and all the rows where the expression evaluates to <code>TRUE</code> are returned.</p></li> </ul> <p>This is what we call a <em>vector scan approach</em>. And this is quite inefficient, especially on larger tables and when one needs repeated subsetting, because it has to scan through all the rows each time.</p> <p>Now let us look at binary search approach (method 2). Recall from <a href="#key-properties">Properties of key</a> - <em>setting keys reorders the data.table by key columns</em>. Since the data is sorted, we don’t have to <em>scan through the entire length of the column</em>! We can instead use <em>binary search</em> to search a value in <code>O(log n)</code> as opposed to <code>O(n)</code> in case of <em>vector scan approach</em>, where <code>n</code> is the number of rows in the <em>data.table</em>.</p> </div> <div id="binary-search-approach" class="section level4"> <h4>Binary search approach</h4> <p>Here’s a very simple illustration. Let’s consider the (sorted) numbers shown below:</p> <div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true"></a><span class="dv">1</span>, <span class="dv">5</span>, <span class="dv">10</span>, <span class="dv">19</span>, <span class="dv">22</span>, <span class="dv">23</span>, <span class="dv">30</span></span></code></pre></div> <p>Suppose we’d like to find the matching position of the value <em>1</em>, using binary search, this is how we would proceed - because we know that the data is <em>sorted</em>.</p> <ul> <li><p>Start with the middle value = 19. Is 1 == 19? No. 1 < 19.</p></li> <li><p>Since the value we’re looking for is smaller than 19, it should be somewhere before 19. So we can discard the rest of the half that are >= 19.</p></li> <li><p>Our set is now reduced to <em>1, 5, 10</em>. Grab the middle value once again = 5. Is 1 == 5? No. 1 < 5.</p></li> <li><p>Our set is reduced to <em>1</em>. Is 1 == 1? Yes. The corresponding index is also 1. And that’s the only match.</p></li> </ul> <p>A vector scan approach on the other hand would have to scan through all the values (here, 7).</p> <p>It can be seen that with every search we reduce the number of searches by half. This is why <em>binary search</em> based subsets are <strong>incredibly fast</strong>. Since rows of each column of <em>data.tables</em> have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to <em>speed</em>).</p> <p>In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a <em>data.table</em>), it is quite <strong>memory efficient</strong> as well.</p> </div> </div> </div> <div id="summary" class="section level2"> <h2>Summary</h2> <p>In this vignette, we have learnt another method to subset rows in <code>i</code> by keying a <em>data.table</em>. Setting keys allows us to perform blazing fast subsets by using <em>binary search</em>. In particular, we have seen how to</p> <ul> <li><p>set key and subset using the key on a <em>data.table</em>.</p></li> <li><p>subset using keys which fetches <em>row indices</em> in <code>i</code>, but much faster.</p></li> <li><p>combine key based subsets with <code>j</code> and <code>by</code>. Note that the <code>j</code> and <code>by</code> operations are exactly the same as before.</p></li> </ul> <p>Key based subsets are <strong>incredibly fast</strong> and are particularly useful when the task involves <em>repeated subsetting</em>. But it may not be always desirable to set key and physically reorder the <em>data.table</em>. In the next vignette, we will address this using a <em>new</em> feature – <em>secondary indexes</em>.</p> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>