EVOLUTION-MANAGER
Edit File: sha1.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta http-equiv="X-UA-Compatible" content="IE=EDGE" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <meta name="author" content="Thierry Onkelinx and Dirk Eddelbuettel" /> <meta name="date" content="2020-02-22" /> <title>Calculating SHA1 hashes with digest() and sha1()</title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> code.sourceCode > span { display: inline-block; line-height: 1.25; } code.sourceCode > span { color: inherit; text-decoration: inherit; } code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' && rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Calculating SHA1 hashes with digest() and sha1()</h1> <h4 class="author">Thierry Onkelinx and Dirk Eddelbuettel</h4> <h4 class="date">2020-02-22</h4> <p>NB: This vignette is work-in-progress and not yet complete.</p> <div id="short-intro-on-hashes" class="section level2"> <h2>Short intro on hashes</h2> <p>TBD</p> </div> <div id="difference-between-digest-and-sha1" class="section level2"> <h2>Difference between <code>digest()</code> and <code>sha1()</code></h2> <p>R <a href="https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f">FAQ 7.31</a> illustrates potential problems with floating point arithmetic. Mathematically the equality <span class="math inline">\(x = \sqrt{x}^2\)</span> should hold. But the precision of floating points numbers is finite. Hence some rounding is done, leading to numbers which are no longer identical.</p> <p>An illustration:</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="co"># FAQ 7.31</span></span> <span id="cb1-2"><a href="#cb1-2"></a>a0 <-<span class="st"> </span><span class="dv">2</span></span> <span id="cb1-3"><a href="#cb1-3"></a>b <-<span class="st"> </span><span class="kw">sqrt</span>(a0)</span> <span id="cb1-4"><a href="#cb1-4"></a>a1 <-<span class="st"> </span>b <span class="op">^</span><span class="st"> </span><span class="dv">2</span></span> <span id="cb1-5"><a href="#cb1-5"></a><span class="kw">identical</span>(a0, a1)</span></code></pre></div> <pre><code>## [1] FALSE</code></pre> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a>a0 <span class="op">-</span><span class="st"> </span>a1</span></code></pre></div> <pre><code>## [1] -4.440892e-16</code></pre> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a>a <-<span class="st"> </span><span class="kw">c</span>(a0, a1)</span> <span id="cb5-2"><a href="#cb5-2"></a><span class="co"># hexadecimal representation</span></span> <span id="cb5-3"><a href="#cb5-3"></a><span class="kw">sprintf</span>(<span class="st">"%a"</span>, a)</span></code></pre></div> <pre><code>## [1] "0x1p+1" "0x1.0000000000001p+1"</code></pre> <p>Although the difference is small, any difference will result in different hash when using the <code>digest()</code> function. However, the <code>sha1()</code> function tackles this problem by using the hexadecimal representation of the numbers and truncates that representation to a certain number of digits prior to calculating the hash function.</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">library</span>(digest)</span> <span id="cb7-2"><a href="#cb7-2"></a><span class="co"># different hashes with digest</span></span> <span id="cb7-3"><a href="#cb7-3"></a><span class="kw">sapply</span>(a, digest, <span class="dt">algo =</span> <span class="st">"sha1"</span>)</span></code></pre></div> <pre><code>## [1] "315a5aa84aa6cfa4f3fb4b652a596770be0365e8" ## [2] "5e3999bf79c230f7430e97d7f30070a7eff5ec92"</code></pre> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a><span class="co"># same hash with sha1 with default digits (14)</span></span> <span id="cb9-2"><a href="#cb9-2"></a><span class="kw">sapply</span>(a, sha1)</span></code></pre></div> <pre><code>## [1] "8a938d8f5fb9b8ccb6893aa1068babcc517f32d4" ## [2] "8a938d8f5fb9b8ccb6893aa1068babcc517f32d4"</code></pre> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a><span class="co"># larger digits can lead to different hashes</span></span> <span id="cb11-2"><a href="#cb11-2"></a><span class="kw">sapply</span>(a, sha1, <span class="dt">digits =</span> <span class="dv">15</span>)</span></code></pre></div> <pre><code>## [1] "98eb1dc9fada00b945d3ef01c77049ee5a4b7b9c" ## [2] "5a173e2445df1134908037f69ac005fbd8afee74"</code></pre> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a><span class="co"># decreasing the number of digits gives a stronger truncation</span></span> <span id="cb13-2"><a href="#cb13-2"></a><span class="co"># the hash will change when then truncation gives a different result</span></span> <span id="cb13-3"><a href="#cb13-3"></a><span class="co"># case where truncating gives same hexadecimal value</span></span> <span id="cb13-4"><a href="#cb13-4"></a><span class="kw">sapply</span>(a, sha1, <span class="dt">digits =</span> <span class="dv">13</span>)</span></code></pre></div> <pre><code>## [1] "43b3b465c975af322c85473190a9214b79b79bf6" ## [2] "43b3b465c975af322c85473190a9214b79b79bf6"</code></pre> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1"></a><span class="kw">sapply</span>(a, sha1, <span class="dt">digits =</span> <span class="dv">10</span>)</span></code></pre></div> <pre><code>## [1] "6b537a9693c750ed535ca90527152f06e358aa3a" ## [2] "6b537a9693c750ed535ca90527152f06e358aa3a"</code></pre> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1"></a><span class="co"># case where truncating gives different hexadecimal value</span></span> <span id="cb17-2"><a href="#cb17-2"></a><span class="kw">c</span>(<span class="kw">sha1</span>(pi), <span class="kw">sha1</span>(pi, <span class="dt">digits =</span> <span class="dv">13</span>), <span class="kw">sha1</span>(pi, <span class="dt">digits =</span> <span class="dv">10</span>))</span></code></pre></div> <pre><code>## [1] "169388cf1ce60dc4e9904a22bc934c5db33d975b" ## [2] "20dc38536b6689d5f2d053f30efb09c44af78378" ## [3] "3a727417bd1807af2f0148cf3de69abff32c23ec"</code></pre> <p>The result of floating point arithematic on 32-bit and 64-bit can be slightly different. E.g. <code>print(pi ^ 11, 22)</code> returns <code>294204.01797389047</code> on 32-bit and <code>294204.01797389053</code> on 64-bit. Note that only the last 2 digits are different.</p> <table> <colgroup> <col width="33%"></col> <col width="33%"></col> <col width="33%"></col> </colgroup> <thead> <tr class="header"> <th>command</th> <th>32-bit</th> <th>64-bit</th> </tr> </thead> <tbody> <tr class="odd"> <td><code>print(pi ^ 11, 22)</code></td> <td><code>294204.01797389047</code></td> <td><code>294204.01797389053</code></td> </tr> <tr class="even"> <td><code>sprintf("%a", pi ^ 11)</code></td> <td><code>"0x1.1f4f01267bf5fp+18"</code></td> <td><code>"0x1.1f4f01267bf6p+18"</code></td> </tr> <tr class="odd"> <td><code>digest(pi ^ 11, algo = "sha1")</code></td> <td><code>"c5efc7f167df1bb402b27cf9b405d7cebfba339a"</code></td> <td><code>"b61f6fea5e2a7952692cefe8bba86a00af3de713"</code></td> </tr> <tr class="even"> <td><code>sha1(pi ^ 11, digits = 14)</code></td> <td><code>"5c7740500b8f78ec2354ea6af58ea69634d9b7b1"</code></td> <td><code>"4f3e296b9922a7ddece2183b1478d0685609a359"</code></td> </tr> <tr class="odd"> <td><code>sha1(pi ^ 11, digits = 13)</code></td> <td><code>"372289f87396b0877ccb4790cf40bcb5e658cad7"</code></td> <td><code>"372289f87396b0877ccb4790cf40bcb5e658cad7"</code></td> </tr> <tr class="even"> <td><code>sha1(pi ^ 11, digits = 10)</code></td> <td><code>"c05965af43f9566bfb5622f335817f674abfc9e4"</code></td> <td><code>"c05965af43f9566bfb5622f335817f674abfc9e4"</code></td> </tr> </tbody> </table> </div> <div id="choosing-digest-or-sha1" class="section level2"> <h2>Choosing <code>digest()</code> or <code>sha1()</code></h2> <p>TBD</p> </div> <div id="creating-a-sha1-method-for-other-classes" class="section level2"> <h2>Creating a sha1 method for other classes</h2> <div id="how-to" class="section level3"> <h3>How to</h3> <ol style="list-style-type: decimal"> <li>Identify the relevant components for the hash.</li> <li>Determine the class of each relevant component and check if they are handled by <code>sha1()</code>. <ul> <li>Write a method for each component class not yet handled by <code>sha1</code>.</li> </ul></li> <li>Extract the relevant components.</li> <li>Combine the relevant components into a list. Not required in case of a single component.</li> <li>Apply <code>sha1()</code> on the (list of) relevant component(s).</li> <li>Turn this into a function with name sha1._classname_.</li> <li>sha1._classname_ needs exactly the same arguments as <code>sha1()</code></li> <li>Choose sensible defaults for the arguments <ul> <li><code>zapsmall = 7</code> is recommended.</li> <li><code>digits = 14</code> is recommended in case all numerics are data.</li> <li><code>digits = 4</code> is recommended in case some numerics stem from floating point arithmetic.</li> </ul></li> </ol> </div> <div id="summary.lm" class="section level3"> <h3>summary.lm</h3> <p>Let’s illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1"></a><span class="co"># taken from the help file of lm.influence</span></span> <span id="cb19-2"><a href="#cb19-2"></a>lm_SR <-<span class="st"> </span><span class="kw">lm</span>(sr <span class="op">~</span><span class="st"> </span>pop15 <span class="op">+</span><span class="st"> </span>pop75 <span class="op">+</span><span class="st"> </span>dpi <span class="op">+</span><span class="st"> </span>ddpi, <span class="dt">data =</span> LifeCycleSavings)</span> <span id="cb19-3"><a href="#cb19-3"></a>lm_sum <-<span class="st"> </span><span class="kw">summary</span>(lm_SR)</span> <span id="cb19-4"><a href="#cb19-4"></a><span class="kw">class</span>(lm_sum)</span></code></pre></div> <pre><code>## [1] "summary.lm"</code></pre> <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1"></a><span class="co"># str() gives the structure of the lm object</span></span> <span id="cb21-2"><a href="#cb21-2"></a><span class="kw">str</span>(lm_sum)</span></code></pre></div> <pre><code>## List of 11 ## $ call : language lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings) ## $ terms :Classes 'terms', 'formula' language sr ~ pop15 + pop75 + dpi + ddpi ## .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi) ## .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ... ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. ..- attr(*, "order")= int [1:4] 1 1 1 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi) ## .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ... ## .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## $ residuals : Named num [1:50] 0.864 0.616 2.219 -0.698 3.553 ... ## ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ... ## $ coefficients : num [1:5, 1:4] 28.566087 -0.461193 -1.691498 -0.000337 0.409695 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" ## $ aliased : Named logi [1:5] FALSE FALSE FALSE FALSE FALSE ## ..- attr(*, "names")= chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## $ sigma : num 3.8 ## $ df : int [1:3] 5 45 5 ## $ r.squared : num 0.338 ## $ adj.r.squared: num 0.28 ## $ fstatistic : Named num [1:3] 5.76 4 45 ## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf" ## $ cov.unscaled : num [1:5, 1:5] 3.74 -7.24e-02 -4.46e-01 -7.86e-05 -1.88e-02 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## - attr(*, "class")= chr "summary.lm"</code></pre> <div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb23-1"><a href="#cb23-1"></a><span class="co"># extract the coefficients and their standard error</span></span> <span id="cb23-2"><a href="#cb23-2"></a>coef_sum <-<span class="st"> </span><span class="kw">coef</span>(lm_sum)[, <span class="kw">c</span>(<span class="st">"Estimate"</span>, <span class="st">"Std. Error"</span>)]</span> <span id="cb23-3"><a href="#cb23-3"></a><span class="co"># extract sigma</span></span> <span id="cb23-4"><a href="#cb23-4"></a>sigma <-<span class="st"> </span>lm_sum<span class="op">$</span>sigma</span> <span id="cb23-5"><a href="#cb23-5"></a><span class="co"># check the class of each component</span></span> <span id="cb23-6"><a href="#cb23-6"></a><span class="kw">class</span>(coef_sum)</span></code></pre></div> <pre><code>## [1] "matrix"</code></pre> <div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb25-1"><a href="#cb25-1"></a><span class="kw">class</span>(sigma)</span></code></pre></div> <pre><code>## [1] "numeric"</code></pre> <div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb27-1"><a href="#cb27-1"></a><span class="co"># sha1() has methods for both matrix and numeric</span></span> <span id="cb27-2"><a href="#cb27-2"></a><span class="co"># because the values originate from floating point arithmetic it is better to use a low number of digits</span></span> <span id="cb27-3"><a href="#cb27-3"></a><span class="kw">sha1</span>(coef_sum, <span class="dt">digits =</span> <span class="dv">4</span>)</span></code></pre></div> <pre><code>## [1] "1ab538262c9aad03d17f33c644d6c8d0b27367e8"</code></pre> <div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1"></a><span class="kw">sha1</span>(sigma, <span class="dt">digits =</span> <span class="dv">4</span>)</span></code></pre></div> <pre><code>## [1] "cbc83d1791ef1eeadd824ea9a038891b5889056b"</code></pre> <div class="sourceCode" id="cb31"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb31-1"><a href="#cb31-1"></a><span class="co"># we want a single hash</span></span> <span id="cb31-2"><a href="#cb31-2"></a><span class="co"># combining the components in a list is a solution that works</span></span> <span id="cb31-3"><a href="#cb31-3"></a><span class="kw">sha1</span>(<span class="kw">list</span>(coef_sum, sigma), <span class="dt">digits =</span> <span class="dv">4</span>)</span></code></pre></div> <pre><code>## [1] "d2e6e07d97e2a97882fd3bbf6e4455140c0c6412"</code></pre> <div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1"></a><span class="co"># now turn everything into an S3 method</span></span> <span id="cb33-2"><a href="#cb33-2"></a><span class="co"># - a function with name "sha1.classname"</span></span> <span id="cb33-3"><a href="#cb33-3"></a><span class="co"># - must have the same arguments as sha1()</span></span> <span id="cb33-4"><a href="#cb33-4"></a>sha1.summary.lm <-<span class="st"> </span><span class="cf">function</span>(x, <span class="dt">digits =</span> <span class="dv">4</span>, <span class="dt">zapsmall =</span> <span class="dv">7</span>){</span> <span id="cb33-5"><a href="#cb33-5"></a> coef_sum <-<span class="st"> </span><span class="kw">coef</span>(x)[, <span class="kw">c</span>(<span class="st">"Estimate"</span>, <span class="st">"Std. Error"</span>)]</span> <span id="cb33-6"><a href="#cb33-6"></a> sigma <-<span class="st"> </span>x<span class="op">$</span>sigma</span> <span id="cb33-7"><a href="#cb33-7"></a> combined <-<span class="st"> </span><span class="kw">list</span>(coef_sum, sigma)</span> <span id="cb33-8"><a href="#cb33-8"></a> <span class="kw">sha1</span>(combined, <span class="dt">digits =</span> digits, <span class="dt">zapsmall =</span> zapsmall)</span> <span id="cb33-9"><a href="#cb33-9"></a>}</span> <span id="cb33-10"><a href="#cb33-10"></a><span class="kw">sha1</span>(lm_sum)</span></code></pre></div> <pre><code>## [1] "d2e6e07d97e2a97882fd3bbf6e4455140c0c6412"</code></pre> <div class="sourceCode" id="cb35"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb35-1"><a href="#cb35-1"></a><span class="co"># try an altered dataset</span></span> <span id="cb35-2"><a href="#cb35-2"></a>LCS2 <-<span class="st"> </span>LifeCycleSavings[<span class="kw">rownames</span>(LifeCycleSavings) <span class="op">!=</span><span class="st"> "Zambia"</span>, ]</span> <span id="cb35-3"><a href="#cb35-3"></a>lm_SR2 <-<span class="st"> </span><span class="kw">lm</span>(sr <span class="op">~</span><span class="st"> </span>pop15 <span class="op">+</span><span class="st"> </span>pop75 <span class="op">+</span><span class="st"> </span>dpi <span class="op">+</span><span class="st"> </span>ddpi, <span class="dt">data =</span> LCS2)</span> <span id="cb35-4"><a href="#cb35-4"></a><span class="kw">sha1</span>(<span class="kw">summary</span>(lm_SR2))</span></code></pre></div> <pre><code>## [1] "637d8121e49b30dbc0b0e8cd02ff5a4a8b9d89e1"</code></pre> </div> <div id="lm" class="section level3"> <h3>lm</h3> <p>Let’s illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.</p> <div class="sourceCode" id="cb37"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb37-1"><a href="#cb37-1"></a><span class="kw">class</span>(lm_SR)</span></code></pre></div> <pre><code>## [1] "lm"</code></pre> <div class="sourceCode" id="cb39"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb39-1"><a href="#cb39-1"></a><span class="co"># str() gives the structure of the lm object</span></span> <span id="cb39-2"><a href="#cb39-2"></a><span class="kw">str</span>(lm_SR)</span></code></pre></div> <pre><code>## List of 12 ## $ coefficients : Named num [1:5] 28.566087 -0.461193 -1.691498 -0.000337 0.409695 ## ..- attr(*, "names")= chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## $ residuals : Named num [1:50] 0.864 0.616 2.219 -0.698 3.553 ... ## ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ... ## $ effects : Named num [1:50] -68.38 -14.29 7.3 -3.52 -7.94 ... ## ..- attr(*, "names")= chr [1:50] "(Intercept)" "pop15" "pop75" "dpi" ... ## $ rank : int 5 ## $ fitted.values: Named num [1:50] 10.57 11.45 10.95 6.45 9.33 ... ## ..- attr(*, "names")= chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ... ## $ assign : int [1:5] 0 1 2 3 4 ## $ qr :List of 5 ## ..$ qr : num [1:50, 1:5] -7.071 0.141 0.141 0.141 0.141 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:50] "Australia" "Austria" "Belgium" "Bolivia" ... ## .. .. ..$ : chr [1:5] "(Intercept)" "pop15" "pop75" "dpi" ... ## .. ..- attr(*, "assign")= int [1:5] 0 1 2 3 4 ## ..$ qraux: num [1:5] 1.14 1.17 1.16 1.15 1.05 ## ..$ pivot: int [1:5] 1 2 3 4 5 ## ..$ tol : num 1e-07 ## ..$ rank : int 5 ## ..- attr(*, "class")= chr "qr" ## $ df.residual : int 45 ## $ xlevels : Named list() ## $ call : language lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings) ## $ terms :Classes 'terms', 'formula' language sr ~ pop15 + pop75 + dpi + ddpi ## .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi) ## .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ... ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. ..- attr(*, "order")= int [1:4] 1 1 1 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi) ## .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ... ## .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## $ model :'data.frame': 50 obs. of 5 variables: ## ..$ sr : num [1:50] 11.43 12.07 13.17 5.75 12.88 ... ## ..$ pop15: num [1:50] 29.4 23.3 23.8 41.9 42.2 ... ## ..$ pop75: num [1:50] 2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ... ## ..$ dpi : num [1:50] 2330 1508 2108 189 728 ... ## ..$ ddpi : num [1:50] 2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ... ## ..- attr(*, "terms")=Classes 'terms', 'formula' language sr ~ pop15 + pop75 + dpi + ddpi ## .. .. ..- attr(*, "variables")= language list(sr, pop15, pop75, dpi, ddpi) ## .. .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ... ## .. .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. .. ..$ : chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## .. .. .. .. ..$ : chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. .. ..- attr(*, "term.labels")= chr [1:4] "pop15" "pop75" "dpi" "ddpi" ## .. .. ..- attr(*, "order")= int [1:4] 1 1 1 1 ## .. .. ..- attr(*, "intercept")= int 1 ## .. .. ..- attr(*, "response")= int 1 ## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. .. ..- attr(*, "predvars")= language list(sr, pop15, pop75, dpi, ddpi) ## .. .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ... ## .. .. .. ..- attr(*, "names")= chr [1:5] "sr" "pop15" "pop75" "dpi" ... ## - attr(*, "class")= chr "lm"</code></pre> <div class="sourceCode" id="cb41"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb41-1"><a href="#cb41-1"></a><span class="co"># extract the model and the terms</span></span> <span id="cb41-2"><a href="#cb41-2"></a>lm_model <-<span class="st"> </span>lm_SR<span class="op">$</span>model</span> <span id="cb41-3"><a href="#cb41-3"></a>lm_terms <-<span class="st"> </span>lm_SR<span class="op">$</span>terms</span> <span id="cb41-4"><a href="#cb41-4"></a><span class="co"># check their class</span></span> <span id="cb41-5"><a href="#cb41-5"></a><span class="kw">class</span>(lm_model) <span class="co"># handled by sha1()</span></span></code></pre></div> <pre><code>## [1] "data.frame"</code></pre> <div class="sourceCode" id="cb43"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb43-1"><a href="#cb43-1"></a><span class="kw">class</span>(lm_terms) <span class="co"># not handled by sha1()</span></span></code></pre></div> <pre><code>## [1] "terms" "formula"</code></pre> <div class="sourceCode" id="cb45"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb45-1"><a href="#cb45-1"></a><span class="co"># define a method for formula</span></span> <span id="cb45-2"><a href="#cb45-2"></a>sha1.formula <-<span class="st"> </span><span class="cf">function</span>(x, <span class="dt">digits =</span> <span class="dv">14</span>, <span class="dt">zapsmall =</span> <span class="dv">7</span>, ..., <span class="dt">algo =</span> <span class="st">"sha1"</span>){</span> <span id="cb45-3"><a href="#cb45-3"></a> <span class="kw">sha1</span>(<span class="kw">as.character</span>(x), <span class="dt">digits =</span> digits, <span class="dt">zapsmall =</span> zapsmall, <span class="dt">algo =</span> algo)</span> <span id="cb45-4"><a href="#cb45-4"></a>}</span> <span id="cb45-5"><a href="#cb45-5"></a><span class="kw">sha1</span>(lm_terms)</span></code></pre></div> <pre><code>## [1] "2737d209720aa7d1c0555050ad06ebe89f3850cd"</code></pre> <div class="sourceCode" id="cb47"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb47-1"><a href="#cb47-1"></a><span class="kw">sha1</span>(lm_model)</span></code></pre></div> <pre><code>## [1] "27b7dd9e3e09b9577da6947b8473b63a1d0b6eb4"</code></pre> <div class="sourceCode" id="cb49"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb49-1"><a href="#cb49-1"></a><span class="co"># define a method for lm</span></span> <span id="cb49-2"><a href="#cb49-2"></a>sha1.lm <-<span class="st"> </span><span class="cf">function</span>(x, <span class="dt">digits =</span> <span class="dv">14</span>, <span class="dt">zapsmall =</span> <span class="dv">7</span>, ..., <span class="dt">algo =</span> <span class="st">"sha1"</span>){</span> <span id="cb49-3"><a href="#cb49-3"></a> lm_model <-<span class="st"> </span>x<span class="op">$</span>model</span> <span id="cb49-4"><a href="#cb49-4"></a> lm_terms <-<span class="st"> </span>x<span class="op">$</span>terms</span> <span id="cb49-5"><a href="#cb49-5"></a> combined <-<span class="st"> </span><span class="kw">list</span>(lm_model, lm_terms)</span> <span id="cb49-6"><a href="#cb49-6"></a> <span class="kw">sha1</span>(combined, <span class="dt">digits =</span> digits, <span class="dt">zapsmall =</span> zapsmall, ..., <span class="dt">algo =</span> algo)</span> <span id="cb49-7"><a href="#cb49-7"></a>}</span> <span id="cb49-8"><a href="#cb49-8"></a><span class="kw">sha1</span>(lm_SR)</span></code></pre></div> <pre><code>## [1] "7eda2a9d58e458c8e782e40ce140d62b836b2a2f"</code></pre> <div class="sourceCode" id="cb51"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb51-1"><a href="#cb51-1"></a><span class="kw">sha1</span>(lm_SR2)</span></code></pre></div> <pre><code>## [1] "4d3abdb1f17bd12fdf9d9b91a2ad04c07824fe4a"</code></pre> </div> </div> <div id="using-hashes-to-track-changes-in-analysis" class="section level2"> <h2>Using hashes to track changes in analysis</h2> <p>Use case</p> <ul> <li><p>automated analysis</p></li> <li><p>update frequency of the data might be lower than the frequency of automated analysis</p></li> <li><p>similar analyses on many datasets (e.g. many species in ecology)</p></li> <li><p>analyses that require a lot of computing time</p> <ul> <li>not rerunning an analysis because nothing has changed saves enough resources to compensate the overhead of tracking changes</li> </ul></li> <li><p>Bundle all relevant information on an analysis in a class</p> <ul> <li>data</li> <li>method</li> <li>formula</li> <li>other metadata</li> <li>resulting model</li> </ul></li> <li><p>calculate <code>sha1()</code></p> <p>file fingerprint ~ <code>sha1()</code> on the stable parts</p> <p>status fingerprint ~ <code>sha1()</code> on the parts that result for the model</p></li> </ul> <ol style="list-style-type: decimal"> <li>Prepare analysis objects</li> <li>Store each analysis object in a rda file which uses the file fingerprint as filename <ul> <li>File will already exist when no change in analysis</li> <li>Don’t overwrite existing files</li> </ul></li> <li>Loop over all rda files <ul> <li>Do nothing if the analysis was run</li> <li>Otherwise run the analysis and update the status and status fingerprint</li> </ul></li> </ol> </div> <!-- code folding --> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>