EVOLUTION-MANAGER
Edit File: dbplyr.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Introduction to dbplyr</title> <script>// Pandoc 2.9 adds attributes on both header and div. We remove the former (to // be compatible with the behavior of Pandoc < 2.8). document.addEventListener('DOMContentLoaded', function(e) { var hs = document.querySelectorAll("div.section[class*='level'] > :first-child"); var i, h, a; for (i = 0; i < hs.length; i++) { h = hs[i]; if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6 a = h.attributes; while (a.length > 0) h.removeAttribute(a[0].name); } }); </script> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Introduction to dbplyr</h1> <p>As well as working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in two scenarios:</p> <ul> <li><p>Your data is already in a database.</p></li> <li><p>You have so much data that it does not all fit into memory simultaneously and you need to use some external storage engine.</p></li> </ul> <p>(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.)</p> <p>This vignette focuses on the first scenario because it’s the most common. If you’re using R to do data analysis inside a company, most of the data you need probably already lives in a database (it’s just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr’s database tools. At the end, I’ll also give you a few pointers if you do need to set up your own database.</p> <div id="getting-started" class="section level2"> <h2>Getting started</h2> <p>To use databases with dplyr you need to first install dbplyr:</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">install.packages</span>(<span class="st">"dbplyr"</span>)</span></code></pre></div> <p>You’ll also need to install a DBI backend package. The DBI package provides a common interface that allows dplyr to work with many different databases using the same code. DBI is automatically installed with dbplyr, but you need to install a specific backend for the database that you want to connect to.</p> <p>Five commonly used backends are:</p> <ul> <li><p><a href="https://CRAN.R-project.org/package=RMariaDB">RMariaDB</a> connects to MySQL and MariaDB</p></li> <li><p><a href="https://CRAN.R-project.org/package=RPostgres">RPostgres</a> connects to Postgres and Redshift.</p></li> <li><p><a href="https://github.com/rstats-db/RSQLite">RSQLite</a> embeds a SQLite database.</p></li> <li><p><a href="https://github.com/rstats-db/odbc#odbc">odbc</a> connects to many commercial databases via the open database connectivity protocol.</p></li> <li><p><a href="https://github.com/rstats-db/bigrquery">bigrquery</a> connects to Google’s BigQuery.</p></li> </ul> <p>If the database you need to connect to is not listed here, you’ll need to do some investigation (i.e. googling) yourself.</p> <p>In this vignette, we’re going to use the RSQLite backend which is automatically installed when you install dbplyr. SQLite is a great way to get started with databases because it’s completely embedded inside an R package. Unlike most other systems, you don’t need to setup a separate database server. SQLite is great for demos, but is surprisingly powerful, and with a little practice you can use it to easily work with many gigabytes of data.</p> </div> <div id="connecting-to-the-database" class="section level2"> <h2>Connecting to the database</h2> <p>To work with a database in dplyr, you must first connect to it, using <code>DBI::dbConnect()</code>. We’re not going to go into the details of the DBI package here, but it’s the foundation upon which dbplyr is built. You’ll need to learn more about if you need to do things to the database that are beyond the scope of dplyr.</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">library</span>(dplyr)</span> <span id="cb2-2"><a href="#cb2-2"></a>con <-<span class="st"> </span>DBI<span class="op">::</span><span class="kw">dbConnect</span>(RSQLite<span class="op">::</span><span class="kw">SQLite</span>(), <span class="dt">dbname =</span> <span class="st">":memory:"</span>)</span></code></pre></div> <p>The arguments to <code>DBI::dbConnect()</code> vary from database to database, but the first argument is always the database backend. It’s <code>RSQLite::SQLite()</code> for RSQLite, <code>RMariaDB::MariaDB()</code> for RMariaDB, <code>RPostgres::Postgres()</code> for RPostgres, <code>odbc::odbc()</code> for odbc, and <code>bigrquery::bigquery()</code> for BigQuery. SQLite only needs one other argument: the path to the database. Here we use the special string <code>":memory:"</code> which causes SQLite to make a temporary in-memory database.</p> <p>Most existing databases don’t live in a file, but instead live on another server. That means in real-life that your code will look more like this:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a>con <-<span class="st"> </span>DBI<span class="op">::</span><span class="kw">dbConnect</span>(RMariaDB<span class="op">::</span><span class="kw">MariaDB</span>(), </span> <span id="cb3-2"><a href="#cb3-2"></a> <span class="dt">host =</span> <span class="st">"database.rstudio.com"</span>,</span> <span id="cb3-3"><a href="#cb3-3"></a> <span class="dt">user =</span> <span class="st">"hadley"</span>,</span> <span id="cb3-4"><a href="#cb3-4"></a> <span class="dt">password =</span> rstudioapi<span class="op">::</span><span class="kw">askForPassword</span>(<span class="st">"Database password"</span>)</span> <span id="cb3-5"><a href="#cb3-5"></a>)</span></code></pre></div> <p>(If you’re not using RStudio, you’ll need some other way to securely retrieve your password. You should never record it in your analysis scripts or type it into the console. <a href="https://db.rstudio.com/best-practices/managing-credentials">Securing Credentials</a> provides some best practices.)</p> <p>Our temporary database has no data in it, so we’ll start by copying over <code>nycflights13::flights</code> using the convenient <code>copy_to()</code> function. This is a quick and dirty way of getting data into a database and is useful primarily for demos and other small jobs.</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">copy_to</span>(con, nycflights13<span class="op">::</span>flights, <span class="st">"flights"</span>,</span> <span id="cb4-2"><a href="#cb4-2"></a> <span class="dt">temporary =</span> <span class="ot">FALSE</span>, </span> <span id="cb4-3"><a href="#cb4-3"></a> <span class="dt">indexes =</span> <span class="kw">list</span>(</span> <span id="cb4-4"><a href="#cb4-4"></a> <span class="kw">c</span>(<span class="st">"year"</span>, <span class="st">"month"</span>, <span class="st">"day"</span>), </span> <span id="cb4-5"><a href="#cb4-5"></a> <span class="st">"carrier"</span>, </span> <span id="cb4-6"><a href="#cb4-6"></a> <span class="st">"tailnum"</span>,</span> <span id="cb4-7"><a href="#cb4-7"></a> <span class="st">"dest"</span></span> <span id="cb4-8"><a href="#cb4-8"></a> )</span> <span id="cb4-9"><a href="#cb4-9"></a>)</span></code></pre></div> <p>As you can see, the <code>copy_to()</code> operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, carrier, plane, and destination. Creating the right indices is key to good database performance, but is unfortunately beyond the scope of this article.</p> <p>Now that we’ve copied the data, we can use <code>tbl()</code> to take a reference to it:</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a>flights_db <-<span class="st"> </span><span class="kw">tbl</span>(con, <span class="st">"flights"</span>)</span></code></pre></div> <p>When you print it out, you’ll notice that it mostly looks like a regular tibble:</p> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a>flights_db </span> <span id="cb6-2"><a href="#cb6-2"></a><span class="co">#> # Source: table<flights> [?? x 19]</span></span> <span id="cb6-3"><a href="#cb6-3"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb6-4"><a href="#cb6-4"></a><span class="co">#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time</span></span> <span id="cb6-5"><a href="#cb6-5"></a><span class="co">#> <int> <int> <int> <int> <int> <dbl> <int> <int></span></span> <span id="cb6-6"><a href="#cb6-6"></a><span class="co">#> 1 2013 1 1 517 515 2 830 819</span></span> <span id="cb6-7"><a href="#cb6-7"></a><span class="co">#> 2 2013 1 1 533 529 4 850 830</span></span> <span id="cb6-8"><a href="#cb6-8"></a><span class="co">#> 3 2013 1 1 542 540 2 923 850</span></span> <span id="cb6-9"><a href="#cb6-9"></a><span class="co">#> 4 2013 1 1 544 545 -1 1004 1022</span></span> <span id="cb6-10"><a href="#cb6-10"></a><span class="co">#> 5 2013 1 1 554 600 -6 812 837</span></span> <span id="cb6-11"><a href="#cb6-11"></a><span class="co">#> 6 2013 1 1 554 558 -4 740 728</span></span> <span id="cb6-12"><a href="#cb6-12"></a><span class="co">#> # … with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,</span></span> <span id="cb6-13"><a href="#cb6-13"></a><span class="co">#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,</span></span> <span id="cb6-14"><a href="#cb6-14"></a><span class="co">#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl></span></span></code></pre></div> <p>The main difference is that you can see that it’s a remote source in a SQLite database.</p> </div> <div id="generating-queries" class="section level2"> <h2>Generating queries</h2> <p>To interact with a database you usually use SQL, the Structured Query Language. SQL is over 40 years old, and is used by pretty much every database in existence. The goal of dbplyr is to automatically generate SQL for you so that you’re not forced to use it. However, SQL is a very large language and dbplyr doesn’t do everything. It focusses on <code>SELECT</code> statements, the SQL you write most often as an analyst.</p> <p>Most of the time you don’t need to know anything about SQL, and you can continue to use the dplyr verbs that you’re already familiar with:</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a>flights_db <span class="op">%>%</span><span class="st"> </span><span class="kw">select</span>(year<span class="op">:</span>day, dep_delay, arr_delay)</span> <span id="cb7-2"><a href="#cb7-2"></a><span class="co">#> # Source: lazy query [?? x 5]</span></span> <span id="cb7-3"><a href="#cb7-3"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb7-4"><a href="#cb7-4"></a><span class="co">#> year month day dep_delay arr_delay</span></span> <span id="cb7-5"><a href="#cb7-5"></a><span class="co">#> <int> <int> <int> <dbl> <dbl></span></span> <span id="cb7-6"><a href="#cb7-6"></a><span class="co">#> 1 2013 1 1 2 11</span></span> <span id="cb7-7"><a href="#cb7-7"></a><span class="co">#> 2 2013 1 1 4 20</span></span> <span id="cb7-8"><a href="#cb7-8"></a><span class="co">#> 3 2013 1 1 2 33</span></span> <span id="cb7-9"><a href="#cb7-9"></a><span class="co">#> 4 2013 1 1 -1 -18</span></span> <span id="cb7-10"><a href="#cb7-10"></a><span class="co">#> 5 2013 1 1 -6 -25</span></span> <span id="cb7-11"><a href="#cb7-11"></a><span class="co">#> 6 2013 1 1 -4 12</span></span> <span id="cb7-12"><a href="#cb7-12"></a><span class="co">#> # … with more rows</span></span> <span id="cb7-13"><a href="#cb7-13"></a></span> <span id="cb7-14"><a href="#cb7-14"></a>flights_db <span class="op">%>%</span><span class="st"> </span><span class="kw">filter</span>(dep_delay <span class="op">></span><span class="st"> </span><span class="dv">240</span>)</span> <span id="cb7-15"><a href="#cb7-15"></a><span class="co">#> # Source: lazy query [?? x 19]</span></span> <span id="cb7-16"><a href="#cb7-16"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb7-17"><a href="#cb7-17"></a><span class="co">#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time</span></span> <span id="cb7-18"><a href="#cb7-18"></a><span class="co">#> <int> <int> <int> <int> <int> <dbl> <int> <int></span></span> <span id="cb7-19"><a href="#cb7-19"></a><span class="co">#> 1 2013 1 1 848 1835 853 1001 1950</span></span> <span id="cb7-20"><a href="#cb7-20"></a><span class="co">#> 2 2013 1 1 1815 1325 290 2120 1542</span></span> <span id="cb7-21"><a href="#cb7-21"></a><span class="co">#> 3 2013 1 1 1842 1422 260 1958 1535</span></span> <span id="cb7-22"><a href="#cb7-22"></a><span class="co">#> 4 2013 1 1 2115 1700 255 2330 1920</span></span> <span id="cb7-23"><a href="#cb7-23"></a><span class="co">#> 5 2013 1 1 2205 1720 285 46 2040</span></span> <span id="cb7-24"><a href="#cb7-24"></a><span class="co">#> 6 2013 1 1 2343 1724 379 314 1938</span></span> <span id="cb7-25"><a href="#cb7-25"></a><span class="co">#> # … with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,</span></span> <span id="cb7-26"><a href="#cb7-26"></a><span class="co">#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,</span></span> <span id="cb7-27"><a href="#cb7-27"></a><span class="co">#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl></span></span> <span id="cb7-28"><a href="#cb7-28"></a></span> <span id="cb7-29"><a href="#cb7-29"></a>flights_db <span class="op">%>%</span><span class="st"> </span></span> <span id="cb7-30"><a href="#cb7-30"></a><span class="st"> </span><span class="kw">group_by</span>(dest) <span class="op">%>%</span></span> <span id="cb7-31"><a href="#cb7-31"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="dt">delay =</span> <span class="kw">mean</span>(dep_time))</span> <span id="cb7-32"><a href="#cb7-32"></a><span class="co">#> Warning: Missing values are always removed in SQL.</span></span> <span id="cb7-33"><a href="#cb7-33"></a><span class="co">#> Use `mean(x, na.rm = TRUE)` to silence this warning</span></span> <span id="cb7-34"><a href="#cb7-34"></a><span class="co">#> This warning is displayed only once per session.</span></span> <span id="cb7-35"><a href="#cb7-35"></a><span class="co">#> # Source: lazy query [?? x 2]</span></span> <span id="cb7-36"><a href="#cb7-36"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb7-37"><a href="#cb7-37"></a><span class="co">#> dest delay</span></span> <span id="cb7-38"><a href="#cb7-38"></a><span class="co">#> <chr> <dbl></span></span> <span id="cb7-39"><a href="#cb7-39"></a><span class="co">#> 1 ABQ 2006.</span></span> <span id="cb7-40"><a href="#cb7-40"></a><span class="co">#> 2 ACK 1033.</span></span> <span id="cb7-41"><a href="#cb7-41"></a><span class="co">#> 3 ALB 1627.</span></span> <span id="cb7-42"><a href="#cb7-42"></a><span class="co">#> 4 ANC 1635.</span></span> <span id="cb7-43"><a href="#cb7-43"></a><span class="co">#> 5 ATL 1293.</span></span> <span id="cb7-44"><a href="#cb7-44"></a><span class="co">#> 6 AUS 1521.</span></span> <span id="cb7-45"><a href="#cb7-45"></a><span class="co">#> # … with more rows</span></span></code></pre></div> <p>However, in the long-run, I highly recommend you at least learn the basics of SQL. It’s a valuable skill for any data scientist, and it will help you debug problems if you run into problems with dplyr’s automatic translation. If you’re completely new to SQL you might start with this <a href="https://www.codecademy.com/learn/learn-sql">codeacademy tutorial</a>. If you have some familiarity with SQL and you’d like to learn more, I found <a href="http://www.sqlite.org/queryplanner.html">how indexes work in SQLite</a> and <a href="http://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql">10 easy steps to a complete understanding of SQL</a> to be particularly helpful.</p> <p>The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible:</p> <ul> <li><p>It never pulls data into R unless you explicitly ask for it.</p></li> <li><p>It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.</p></li> </ul> <p>For example, take the following code:</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a>tailnum_delay_db <-<span class="st"> </span>flights_db <span class="op">%>%</span><span class="st"> </span></span> <span id="cb8-2"><a href="#cb8-2"></a><span class="st"> </span><span class="kw">group_by</span>(tailnum) <span class="op">%>%</span></span> <span id="cb8-3"><a href="#cb8-3"></a><span class="st"> </span><span class="kw">summarise</span>(</span> <span id="cb8-4"><a href="#cb8-4"></a> <span class="dt">delay =</span> <span class="kw">mean</span>(arr_delay),</span> <span id="cb8-5"><a href="#cb8-5"></a> <span class="dt">n =</span> <span class="kw">n</span>()</span> <span id="cb8-6"><a href="#cb8-6"></a> ) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb8-7"><a href="#cb8-7"></a><span class="st"> </span><span class="kw">arrange</span>(<span class="kw">desc</span>(delay)) <span class="op">%>%</span></span> <span id="cb8-8"><a href="#cb8-8"></a><span class="st"> </span><span class="kw">filter</span>(n <span class="op">></span><span class="st"> </span><span class="dv">100</span>)</span></code></pre></div> <p>Surprisingly, this sequence of operations never touches the database. It’s not until you ask for the data (e.g. by printing <code>tailnum_delay</code>) that dplyr generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a>tailnum_delay_db</span> <span id="cb9-2"><a href="#cb9-2"></a><span class="co">#> # Source: lazy query [?? x 3]</span></span> <span id="cb9-3"><a href="#cb9-3"></a><span class="co">#> # Database: sqlite 3.30.1 [:memory:]</span></span> <span id="cb9-4"><a href="#cb9-4"></a><span class="co">#> # Ordered by: desc(delay)</span></span> <span id="cb9-5"><a href="#cb9-5"></a><span class="co">#> tailnum delay n</span></span> <span id="cb9-6"><a href="#cb9-6"></a><span class="co">#> <chr> <dbl> <int></span></span> <span id="cb9-7"><a href="#cb9-7"></a><span class="co">#> 1 N11119 30.3 148</span></span> <span id="cb9-8"><a href="#cb9-8"></a><span class="co">#> 2 N16919 29.9 251</span></span> <span id="cb9-9"><a href="#cb9-9"></a><span class="co">#> 3 N14998 27.9 230</span></span> <span id="cb9-10"><a href="#cb9-10"></a><span class="co">#> 4 N15910 27.6 280</span></span> <span id="cb9-11"><a href="#cb9-11"></a><span class="co">#> 5 N13123 26.0 121</span></span> <span id="cb9-12"><a href="#cb9-12"></a><span class="co">#> 6 N11192 25.9 154</span></span> <span id="cb9-13"><a href="#cb9-13"></a><span class="co">#> # … with more rows</span></span></code></pre></div> <p>Behind the scenes, dplyr is translating your R code into SQL. You can see the SQL it’s generating with <code>show_query()</code>:</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1"></a>tailnum_delay_db <span class="op">%>%</span><span class="st"> </span><span class="kw">show_query</span>()</span> <span id="cb10-2"><a href="#cb10-2"></a><span class="co">#> <SQL></span></span> <span id="cb10-3"><a href="#cb10-3"></a><span class="co">#> SELECT *</span></span> <span id="cb10-4"><a href="#cb10-4"></a><span class="co">#> FROM (SELECT *</span></span> <span id="cb10-5"><a href="#cb10-5"></a><span class="co">#> FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT() AS `n`</span></span> <span id="cb10-6"><a href="#cb10-6"></a><span class="co">#> FROM `flights`</span></span> <span id="cb10-7"><a href="#cb10-7"></a><span class="co">#> GROUP BY `tailnum`)</span></span> <span id="cb10-8"><a href="#cb10-8"></a><span class="co">#> ORDER BY `delay` DESC)</span></span> <span id="cb10-9"><a href="#cb10-9"></a><span class="co">#> WHERE (`n` > 100.0)</span></span></code></pre></div> <p>If you’re familiar with SQL, this probably isn’t exactly what you’d write by hand, but it does the job. You can learn more about the SQL translation in <code>vignette("translation-verb")</code> and <code>vignette("translation-function")</code>.</p> <p>Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use <code>collect()</code> to pull all the data down into a local tibble:</p> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a>tailnum_delay <-<span class="st"> </span>tailnum_delay_db <span class="op">%>%</span><span class="st"> </span><span class="kw">collect</span>()</span> <span id="cb11-2"><a href="#cb11-2"></a>tailnum_delay</span> <span id="cb11-3"><a href="#cb11-3"></a><span class="co">#> # A tibble: 1,201 x 3</span></span> <span id="cb11-4"><a href="#cb11-4"></a><span class="co">#> tailnum delay n</span></span> <span id="cb11-5"><a href="#cb11-5"></a><span class="co">#> <chr> <dbl> <int></span></span> <span id="cb11-6"><a href="#cb11-6"></a><span class="co">#> 1 N11119 30.3 148</span></span> <span id="cb11-7"><a href="#cb11-7"></a><span class="co">#> 2 N16919 29.9 251</span></span> <span id="cb11-8"><a href="#cb11-8"></a><span class="co">#> 3 N14998 27.9 230</span></span> <span id="cb11-9"><a href="#cb11-9"></a><span class="co">#> 4 N15910 27.6 280</span></span> <span id="cb11-10"><a href="#cb11-10"></a><span class="co">#> 5 N13123 26.0 121</span></span> <span id="cb11-11"><a href="#cb11-11"></a><span class="co">#> 6 N11192 25.9 154</span></span> <span id="cb11-12"><a href="#cb11-12"></a><span class="co">#> # … with 1,195 more rows</span></span></code></pre></div> <p><code>collect()</code> requires that database does some work, so it may take a long time to complete. Otherwise, dplyr tries to prevent you from accidentally performing expensive query operations:</p> <ul> <li><p>Because there’s generally no way to determine how many rows a query will return unless you actually run it, <code>nrow()</code> is always <code>NA</code>.</p></li> <li><p>Because you can’t find the last few rows without executing the whole query, you can’t use <code>tail()</code>.</p></li> </ul> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1"></a><span class="kw">nrow</span>(tailnum_delay_db)</span> <span id="cb12-2"><a href="#cb12-2"></a><span class="co">#> [1] NA</span></span> <span id="cb12-3"><a href="#cb12-3"></a></span> <span id="cb12-4"><a href="#cb12-4"></a><span class="kw">tail</span>(tailnum_delay_db)</span> <span id="cb12-5"><a href="#cb12-5"></a><span class="co">#> Error: tail() is not supported by sql sources</span></span></code></pre></div> <p>You can also ask the database how it plans to execute the query with <code>explain()</code>. The output is database dependent, and can be esoteric, but learning a bit about it can be very useful because it helps you understand if the database can execute the query efficiently, or if you need to create new indices.</p> </div> <div id="creating-your-own-database" class="section level2"> <h2>Creating your own database</h2> <p>If you don’t already have a database, here’s some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with. PostgreSQL is not too much harder to use and has a wide range of built-in functions. In my opinion, you shouldn’t bother with MySQL/MariaDB: it’s a pain to set up, the documentation is subpar, and it’s less featureful than Postgres. Google BigQuery might be a good fit if you have very large data, or if you’re willing to pay (a small amount of) money to someone who’ll look after your database.</p> <p>All of these databases follow a client-server model - a computer that connects to the database and the computer that is running the database (the two may be one and the same but usually isn’t). Getting one of these databases up and running is beyond the scope of this article, but there are plenty of tutorials available on the web.</p> <div id="mysqlmariadb" class="section level3"> <h3>MySQL/MariaDB</h3> <p>In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. It provides a wider range of <a href="http://dev.mysql.com/doc/refman/5.0/en/functions.html">built-in functions</a>. It gained support for window functions in 2018.</p> </div> <div id="postgresql" class="section level3"> <h3>PostgreSQL</h3> <p>PostgreSQL is a considerably more powerful database than SQLite. It has a much wider range of <a href="http://www.postgresql.org/docs/current/static/functions.html">built-in functions</a>, and is generally a more featureful database.</p> </div> <div id="bigquery" class="section level3"> <h3>BigQuery</h3> <p>BigQuery is a hosted database server provided by Google. To connect, you need to provide your <code>project</code>, <code>dataset</code> and optionally a project for <code>billing</code> (if billing for <code>project</code> isn’t enabled).</p> <p>It provides a similar set of functions to Postgres and is designed specifically for analytic workflows. Because it’s a hosted solution, there’s no setup involved, but if you have a lot of data, getting it to Google can be an ordeal (especially because upload support from R is not great currently). (If you have lots of data, you can <a href="https://cloud.google.com/storage/docs/offline-media-import-export">ship hard drives</a>!)</p> </div> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>