EVOLUTION-MANAGER
Edit File: broom_and_dplyr.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="date" content="2020-06-25" /> <title>broom and dplyr</title> <script>// Hide empty <a> tag within highlighted CodeBlock for screen reader accessibility (see https://github.com/jgm/pandoc/issues/6352#issuecomment-626106786) --> // v0.0.1 // Written by JooYoung Seo (jooyoung@psu.edu) and Atsushi Yasumoto on June 1st, 2020. document.addEventListener('DOMContentLoaded', function() { const codeList = document.getElementsByClassName("sourceCode"); for (var i = 0; i < codeList.length; i++) { var linkList = codeList[i].getElementsByTagName('a'); for (var j = 0; j < linkList.length; j++) { if (linkList[j].innerHTML === "") { linkList[j].setAttribute('aria-hidden', 'true'); } } } }); </script> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> code.sourceCode > span { display: inline-block; line-height: 1.25; } code.sourceCode > span { color: inherit; text-decoration: inherit; } code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">broom and dplyr</h1> <h4 class="date">2020-06-25</h4> <div id="broom-and-dplyr" class="section level1"> <h1>broom and dplyr</h1> <p>While broom is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-throughput applications, where you must combine results from multiple analyses. These could be subgroups of data, analyses using different models, bootstrap replicates, permutations, and so on. In particular, it plays well with the <code>nest/unnest</code> functions in <code>tidyr</code> and the <code>map</code> function in <code>purrr</code>. First, loading necessary packages and setting some defaults:</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(broom)</span> <span id="cb1-2"><a href="#cb1-2"></a><span class="kw">library</span>(tibble)</span> <span id="cb1-3"><a href="#cb1-3"></a><span class="kw">library</span>(ggplot2)</span> <span id="cb1-4"><a href="#cb1-4"></a><span class="kw">library</span>(dplyr)</span> <span id="cb1-5"><a href="#cb1-5"></a><span class="kw">library</span>(tidyr)</span> <span id="cb1-6"><a href="#cb1-6"></a><span class="kw">library</span>(purrr)</span> <span id="cb1-7"><a href="#cb1-7"></a></span> <span id="cb1-8"><a href="#cb1-8"></a><span class="kw">theme_set</span>(<span class="kw">theme_minimal</span>())</span></code></pre></div> <p>Let’s try this on a simple dataset, the built-in <code>Orange</code>. We start by coercing <code>Orange</code> to a <code>tibble</code>. This gives a nicer print method that will especially useful later on when we start working with list-columns.</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">data</span>(Orange)</span> <span id="cb2-2"><a href="#cb2-2"></a></span> <span id="cb2-3"><a href="#cb2-3"></a>Orange <-<span class="st"> </span><span class="kw">as_tibble</span>(Orange)</span> <span id="cb2-4"><a href="#cb2-4"></a>Orange</span></code></pre></div> <pre><code>## # A tibble: 35 x 3 ## Tree age circumference ## <ord> <dbl> <dbl> ## 1 1 118 30 ## 2 1 484 58 ## 3 1 664 87 ## 4 1 1004 115 ## 5 1 1231 120 ## 6 1 1372 142 ## 7 1 1582 145 ## 8 2 118 33 ## 9 2 484 69 ## 10 2 664 111 ## # … with 25 more rows</code></pre> <p>This contains 35 observations of three variables: <code>Tree</code>, <code>age</code>, and <code>circumference</code>. <code>Tree</code> is a factor with five levels describing five trees. As might be expected, age and circumference are correlated:</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">cor</span>(Orange<span class="op">$</span>age, Orange<span class="op">$</span>circumference)</span></code></pre></div> <pre><code>## [1] 0.9135189</code></pre> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">ggplot</span>(Orange, <span class="kw">aes</span>(age, circumference, <span class="dt">color =</span> Tree)) <span class="op">+</span></span> <span id="cb6-2"><a href="#cb6-2"></a><span class="st"> </span><span class="kw">geom_line</span>()</span></code></pre></div> <p><img src="" /><!-- --></p> <p>Suppose you want to test for correlations individually <em>within</em> each tree. You can do this with dplyr’s <code>group_by</code>:</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a>Orange <span class="op">%>%</span><span class="st"> </span></span> <span id="cb7-2"><a href="#cb7-2"></a><span class="st"> </span><span class="kw">group_by</span>(Tree) <span class="op">%>%</span></span> <span id="cb7-3"><a href="#cb7-3"></a><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">correlation =</span> <span class="kw">cor</span>(age, circumference))</span></code></pre></div> <pre><code>## # A tibble: 5 x 2 ## Tree correlation ## <ord> <dbl> ## 1 3 0.988 ## 2 1 0.985 ## 3 5 0.988 ## 4 2 0.987 ## 5 4 0.984</code></pre> <p>(Note that the correlations are much higher than the aggregated one, and furthermore we can now see it is similar across trees).</p> <p>Suppose that instead of simply estimating a correlation, we want to perform a hypothesis test with <code>cor.test</code>:</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a>ct <-<span class="st"> </span><span class="kw">cor.test</span>(Orange<span class="op">$</span>age, Orange<span class="op">$</span>circumference)</span> <span id="cb9-2"><a href="#cb9-2"></a>ct</span></code></pre></div> <pre><code>## ## Pearson's product-moment correlation ## ## data: Orange$age and Orange$circumference ## t = 12.9, df = 33, p-value = 1.931e-14 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.8342364 0.9557955 ## sample estimates: ## cor ## 0.9135189</code></pre> <p>This contains multiple values we could want in our output. Some are vectors of length 1, such as the p-value and the estimate, and some are longer, such as the confidence interval. We can get this into a nicely organized tibble using the <code>tidy</code> function:</p> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a><span class="kw">tidy</span>(ct)</span></code></pre></div> <pre><code>## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 0.914 12.9 1.93e-14 33 0.834 0.956 Pearson'… two.sided</code></pre> <p>Often, we want to perform multiple tests or fit multiple models, each on a different part of the data. In this case, we recommend a <code>nest-map-unnest</code> workflow. For example, suppose we want to perform correlation tests for each different tree. We start by <code>nest</code>ing our data based on the group of interest:</p> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a>nested <-<span class="st"> </span>Orange <span class="op">%>%</span><span class="st"> </span></span> <span id="cb13-2"><a href="#cb13-2"></a><span class="st"> </span><span class="kw">nest</span>(<span class="dt">data =</span> <span class="op">-</span>Tree)</span></code></pre></div> <p>Then we run a correlation test for each nested tibble using <code>purrr::map</code>:</p> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1"></a>nested <span class="op">%>%</span><span class="st"> </span></span> <span id="cb14-2"><a href="#cb14-2"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">test =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">cor.test</span>(.x<span class="op">$</span>age, .x<span class="op">$</span>circumference)))</span></code></pre></div> <pre><code>## # A tibble: 5 x 3 ## Tree data test ## <ord> <list> <list> ## 1 1 <tibble [7 × 2]> <htest> ## 2 2 <tibble [7 × 2]> <htest> ## 3 3 <tibble [7 × 2]> <htest> ## 4 4 <tibble [7 × 2]> <htest> ## 5 5 <tibble [7 × 2]> <htest></code></pre> <p>This results in a list-column of S3 objects. We want to tidy each of the objects, which we can also do with <code>map</code>.</p> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1"></a>nested <span class="op">%>%</span><span class="st"> </span></span> <span id="cb16-2"><a href="#cb16-2"></a><span class="st"> </span><span class="kw">mutate</span>(</span> <span id="cb16-3"><a href="#cb16-3"></a> <span class="dt">test =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">cor.test</span>(.x<span class="op">$</span>age, .x<span class="op">$</span>circumference)), <span class="co"># S3 list-col</span></span> <span id="cb16-4"><a href="#cb16-4"></a> <span class="dt">tidied =</span> <span class="kw">map</span>(test, tidy)</span> <span id="cb16-5"><a href="#cb16-5"></a> ) </span></code></pre></div> <pre><code>## # A tibble: 5 x 4 ## Tree data test tidied ## <ord> <list> <list> <list> ## 1 1 <tibble [7 × 2]> <htest> <tibble [1 × 8]> ## 2 2 <tibble [7 × 2]> <htest> <tibble [1 × 8]> ## 3 3 <tibble [7 × 2]> <htest> <tibble [1 × 8]> ## 4 4 <tibble [7 × 2]> <htest> <tibble [1 × 8]> ## 5 5 <tibble [7 × 2]> <htest> <tibble [1 × 8]></code></pre> <p>Finally, we want to unnest the tidied data frames so we can see the results in a flat tibble. All together, this looks like:</p> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1"></a>Orange <span class="op">%>%</span><span class="st"> </span></span> <span id="cb18-2"><a href="#cb18-2"></a><span class="st"> </span><span class="kw">nest</span>(<span class="dt">data =</span> <span class="op">-</span>Tree) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb18-3"><a href="#cb18-3"></a><span class="st"> </span><span class="kw">mutate</span>(</span> <span id="cb18-4"><a href="#cb18-4"></a> <span class="dt">test =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">cor.test</span>(.x<span class="op">$</span>age, .x<span class="op">$</span>circumference)), <span class="co"># S3 list-col</span></span> <span id="cb18-5"><a href="#cb18-5"></a> <span class="dt">tidied =</span> <span class="kw">map</span>(test, tidy)</span> <span id="cb18-6"><a href="#cb18-6"></a> ) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb18-7"><a href="#cb18-7"></a><span class="st"> </span><span class="kw">unnest</span>(tidied)</span></code></pre></div> <pre><code>## # A tibble: 5 x 11 ## Tree data test estimate statistic p.value parameter conf.low conf.high ## <ord> <lis> <lis> <dbl> <dbl> <dbl> <int> <dbl> <dbl> ## 1 1 <tib… <hte… 0.985 13.0 4.85e-5 5 0.901 0.998 ## 2 2 <tib… <hte… 0.987 13.9 3.43e-5 5 0.914 0.998 ## 3 3 <tib… <hte… 0.988 14.4 2.90e-5 5 0.919 0.998 ## 4 4 <tib… <hte… 0.984 12.5 5.73e-5 5 0.895 0.998 ## 5 5 <tib… <hte… 0.988 14.1 3.18e-5 5 0.916 0.998 ## # … with 2 more variables: method <chr>, alternative <chr></code></pre> <p>This workflow becomes even more useful when applied to regressions. Untidy output for a regression looks like:</p> <div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1"></a>lm_fit <-<span class="st"> </span><span class="kw">lm</span>(age <span class="op">~</span><span class="st"> </span>circumference, <span class="dt">data =</span> Orange)</span> <span id="cb20-2"><a href="#cb20-2"></a><span class="kw">summary</span>(lm_fit)</span></code></pre></div> <pre><code>## ## Call: ## lm(formula = age ~ circumference, data = Orange) ## ## Residuals: ## Min 1Q Median 3Q Max ## -317.88 -140.90 -17.20 96.54 471.16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.6036 78.1406 0.212 0.833 ## circumference 7.8160 0.6059 12.900 1.93e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 203.1 on 33 degrees of freedom ## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295 ## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14</code></pre> <p>where we tidy these results, we get multiple rows of output for each model:</p> <div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb22-1"><a href="#cb22-1"></a><span class="kw">tidy</span>(lm_fit)</span></code></pre></div> <pre><code>## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 16.6 78.1 0.212 8.33e- 1 ## 2 circumference 7.82 0.606 12.9 1.93e-14</code></pre> <p>Now we can handle multiple regressions at once using exactly the same workflow as before:</p> <div class="sourceCode" id="cb24"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb24-1"><a href="#cb24-1"></a>Orange <span class="op">%>%</span></span> <span id="cb24-2"><a href="#cb24-2"></a><span class="st"> </span><span class="kw">nest</span>(<span class="dt">data =</span> <span class="op">-</span>Tree) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb24-3"><a href="#cb24-3"></a><span class="st"> </span><span class="kw">mutate</span>(</span> <span id="cb24-4"><a href="#cb24-4"></a> <span class="dt">fit =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">lm</span>(age <span class="op">~</span><span class="st"> </span>circumference, <span class="dt">data =</span> .x)),</span> <span id="cb24-5"><a href="#cb24-5"></a> <span class="dt">tidied =</span> <span class="kw">map</span>(fit, tidy)</span> <span id="cb24-6"><a href="#cb24-6"></a> ) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb24-7"><a href="#cb24-7"></a><span class="st"> </span><span class="kw">unnest</span>(tidied)</span></code></pre></div> <pre><code>## # A tibble: 10 x 8 ## Tree data fit term estimate std.error statistic p.value ## <ord> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 <tibble [7 × … <lm> (Intercept) -265. 98.6 -2.68 4.36e-2 ## 2 1 <tibble [7 × … <lm> circumfere… 11.9 0.919 13.0 4.85e-5 ## 3 2 <tibble [7 × … <lm> (Intercept) -132. 83.1 -1.59 1.72e-1 ## 4 2 <tibble [7 × … <lm> circumfere… 7.80 0.560 13.9 3.43e-5 ## 5 3 <tibble [7 × … <lm> (Intercept) -210. 85.3 -2.46 5.74e-2 ## 6 3 <tibble [7 × … <lm> circumfere… 12.0 0.835 14.4 2.90e-5 ## 7 4 <tibble [7 × … <lm> (Intercept) -76.5 88.3 -0.867 4.26e-1 ## 8 4 <tibble [7 × … <lm> circumfere… 7.17 0.572 12.5 5.73e-5 ## 9 5 <tibble [7 × … <lm> (Intercept) -54.5 76.9 -0.709 5.10e-1 ## 10 5 <tibble [7 × … <lm> circumfere… 8.79 0.621 14.1 3.18e-5</code></pre> <p>You can just as easily use multiple predictors in the regressions, as shown here on the <code>mtcars</code> dataset. We nest the data into automatic and manual cars (the <code>am</code> column), then perform the regression within each nested tibble.</p> <div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb26-1"><a href="#cb26-1"></a><span class="kw">data</span>(mtcars)</span> <span id="cb26-2"><a href="#cb26-2"></a>mtcars <-<span class="st"> </span><span class="kw">as_tibble</span>(mtcars) <span class="co"># to play nicely with list-cols</span></span> <span id="cb26-3"><a href="#cb26-3"></a>mtcars</span></code></pre></div> <pre><code>## # A tibble: 32 x 11 ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 ## # … with 22 more rows</code></pre> <div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb28-1"><a href="#cb28-1"></a>mtcars <span class="op">%>%</span></span> <span id="cb28-2"><a href="#cb28-2"></a><span class="st"> </span><span class="kw">nest</span>(<span class="dt">data =</span> <span class="op">-</span>am) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb28-3"><a href="#cb28-3"></a><span class="st"> </span><span class="kw">mutate</span>(</span> <span id="cb28-4"><a href="#cb28-4"></a> <span class="dt">fit =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">lm</span>(wt <span class="op">~</span><span class="st"> </span>mpg <span class="op">+</span><span class="st"> </span>qsec <span class="op">+</span><span class="st"> </span>gear, <span class="dt">data =</span> .x)), <span class="co"># S3 list-col</span></span> <span id="cb28-5"><a href="#cb28-5"></a> <span class="dt">tidied =</span> <span class="kw">map</span>(fit, tidy)</span> <span id="cb28-6"><a href="#cb28-6"></a> ) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb28-7"><a href="#cb28-7"></a><span class="st"> </span><span class="kw">unnest</span>(tidied)</span></code></pre></div> <pre><code>## # A tibble: 8 x 8 ## am data fit term estimate std.error statistic p.value ## <dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 <tibble [13 × 10… <lm> (Intercep… 4.28 3.46 1.24 2.47e-1 ## 2 1 <tibble [13 × 10… <lm> mpg -0.101 0.0294 -3.43 7.50e-3 ## 3 1 <tibble [13 × 10… <lm> qsec 0.0398 0.151 0.264 7.98e-1 ## 4 1 <tibble [13 × 10… <lm> gear -0.0229 0.349 -0.0656 9.49e-1 ## 5 0 <tibble [19 × 10… <lm> (Intercep… 4.92 1.40 3.52 3.09e-3 ## 6 0 <tibble [19 × 10… <lm> mpg -0.192 0.0443 -4.33 5.91e-4 ## 7 0 <tibble [19 × 10… <lm> qsec 0.0919 0.0983 0.935 3.65e-1 ## 8 0 <tibble [19 × 10… <lm> gear 0.147 0.368 0.398 6.96e-1</code></pre> <p>What if you want not just the <code>tidy</code> output, but the <code>augment</code> and <code>glance</code> outputs as well, while still performing each regression only once? Since we’re using list-columns, we can just fit the model once and use multiple list-columns to store the tidied, glanced and augmented outputs.</p> <div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb30-1"><a href="#cb30-1"></a>regressions <-<span class="st"> </span>mtcars <span class="op">%>%</span></span> <span id="cb30-2"><a href="#cb30-2"></a><span class="st"> </span><span class="kw">nest</span>(<span class="dt">data =</span> <span class="op">-</span>am) <span class="op">%>%</span><span class="st"> </span></span> <span id="cb30-3"><a href="#cb30-3"></a><span class="st"> </span><span class="kw">mutate</span>(</span> <span id="cb30-4"><a href="#cb30-4"></a> <span class="dt">fit =</span> <span class="kw">map</span>(data, <span class="op">~</span><span class="st"> </span><span class="kw">lm</span>(wt <span class="op">~</span><span class="st"> </span>mpg <span class="op">+</span><span class="st"> </span>qsec <span class="op">+</span><span class="st"> </span>gear, <span class="dt">data =</span> .x)),</span> <span id="cb30-5"><a href="#cb30-5"></a> <span class="dt">tidied =</span> <span class="kw">map</span>(fit, tidy),</span> <span id="cb30-6"><a href="#cb30-6"></a> <span class="dt">glanced =</span> <span class="kw">map</span>(fit, glance),</span> <span id="cb30-7"><a href="#cb30-7"></a> <span class="dt">augmented =</span> <span class="kw">map</span>(fit, augment)</span> <span id="cb30-8"><a href="#cb30-8"></a> )</span> <span id="cb30-9"><a href="#cb30-9"></a></span> <span id="cb30-10"><a href="#cb30-10"></a>regressions <span class="op">%>%</span><span class="st"> </span></span> <span id="cb30-11"><a href="#cb30-11"></a><span class="st"> </span><span class="kw">unnest</span>(tidied)</span></code></pre></div> <pre><code>## # A tibble: 8 x 10 ## am data fit term estimate std.error statistic p.value glanced augmented ## <dbl> <lis> <lis> <chr> <dbl> <dbl> <dbl> <dbl> <list> <list> ## 1 1 <tib… <lm> (Int… 4.28 3.46 1.24 2.47e-1 <tibbl… <tibble … ## 2 1 <tib… <lm> mpg -0.101 0.0294 -3.43 7.50e-3 <tibbl… <tibble … ## 3 1 <tib… <lm> qsec 0.0398 0.151 0.264 7.98e-1 <tibbl… <tibble … ## 4 1 <tib… <lm> gear -0.0229 0.349 -0.0656 9.49e-1 <tibbl… <tibble … ## 5 0 <tib… <lm> (Int… 4.92 1.40 3.52 3.09e-3 <tibbl… <tibble … ## 6 0 <tib… <lm> mpg -0.192 0.0443 -4.33 5.91e-4 <tibbl… <tibble … ## 7 0 <tib… <lm> qsec 0.0919 0.0983 0.935 3.65e-1 <tibbl… <tibble … ## 8 0 <tib… <lm> gear 0.147 0.368 0.398 6.96e-1 <tibbl… <tibble …</code></pre> <div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb32-1"><a href="#cb32-1"></a>regressions <span class="op">%>%</span><span class="st"> </span></span> <span id="cb32-2"><a href="#cb32-2"></a><span class="st"> </span><span class="kw">unnest</span>(glanced)</span></code></pre></div> <pre><code>## # A tibble: 2 x 17 ## am data fit tidied r.squared adj.r.squared sigma statistic p.value df ## <dbl> <lis> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 <tib… <lm> <tibb… 0.833 0.778 0.291 15.0 7.59e-4 3 ## 2 0 <tib… <lm> <tibb… 0.625 0.550 0.522 8.32 1.70e-3 3 ## # … with 7 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, ## # df.residual <int>, nobs <int>, augmented <list></code></pre> <div class="sourceCode" id="cb34"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb34-1"><a href="#cb34-1"></a>regressions <span class="op">%>%</span><span class="st"> </span></span> <span id="cb34-2"><a href="#cb34-2"></a><span class="st"> </span><span class="kw">unnest</span>(augmented)</span></code></pre></div> <pre><code>## # A tibble: 32 x 15 ## am data fit tidied glanced wt mpg qsec gear .fitted .resid ## <dbl> <lis> <lis> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 <tib… <lm> <tibb… <tibbl… 2.62 21 16.5 4 2.73 -0.107 ## 2 1 <tib… <lm> <tibb… <tibbl… 2.88 21 17.0 4 2.75 0.126 ## 3 1 <tib… <lm> <tibb… <tibbl… 2.32 22.8 18.6 4 2.63 -0.310 ## 4 1 <tib… <lm> <tibb… <tibbl… 2.2 32.4 19.5 4 1.70 0.505 ## 5 1 <tib… <lm> <tibb… <tibbl… 1.62 30.4 18.5 4 1.86 -0.244 ## 6 1 <tib… <lm> <tibb… <tibbl… 1.84 33.9 19.9 4 1.56 0.274 ## 7 1 <tib… <lm> <tibb… <tibbl… 1.94 27.3 18.9 4 2.19 -0.253 ## 8 1 <tib… <lm> <tibb… <tibbl… 2.14 26 16.7 5 2.21 -0.0683 ## 9 1 <tib… <lm> <tibb… <tibbl… 1.51 30.4 16.9 5 1.77 -0.259 ## 10 1 <tib… <lm> <tibb… <tibbl… 3.17 15.8 14.5 5 3.15 0.0193 ## # … with 22 more rows, and 4 more variables: .std.resid <dbl>, .hat <dbl>, ## # .sigma <dbl>, .cooksd <dbl></code></pre> <p>By combining the estimates and p-values across all groups into the same tidy data frame (instead of a list of output model objects), a new class of analyses and visualizations becomes straightforward. This includes</p> <ul> <li>Sorting by p-value or estimate to find the most significant terms across all tests</li> <li>P-value histograms</li> <li>Volcano plots comparing p-values to effect size estimates</li> </ul> <p>In each of these cases, we can easily filter, facet, or distinguish based on the <code>term</code> column. In short, this makes the tools of tidy data analysis available for the <em>results</em> of data analysis and models, not just the inputs.</p> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>