EVOLUTION-MANAGER
Edit File: stringr.html
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Introduction to stringr</title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css" data-origin="pandoc"> a.sourceLine { display: inline-block; line-height: 1.25; } a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; } a.sourceLine:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } a.sourceLine { text-indent: -1em; padding-left: 1em; } } pre.numberSource a.sourceLine { position: relative; left: -4em; } pre.numberSource a.sourceLine::before { content: attr(data-line-number); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; pointer-events: all; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { a.sourceLine::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> <script> // apply pandoc div.sourceCode style to pre.sourceCode instead (function() { var sheets = document.styleSheets; for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } for (var j = 0; j < rules.length; j++) { var rule = rules[j]; // check if there is a div.sourceCode rule if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; var style = rule.style.cssText; // check if color or background-color is set if (rule.style.color === '' || rule.style.backgroundColor === '') continue; // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); } } })(); </script> <style type="text/css">body { background-color: #fff; margin: 1em auto; max-width: 700px; overflow: visible; padding-left: 2em; padding-right: 2em; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.35; } #header { text-align: center; } #TOC { clear: both; margin: 0 0 10px 10px; padding: 4px; width: 400px; border: 1px solid #CCCCCC; border-radius: 5px; background-color: #f6f6f6; font-size: 13px; line-height: 1.3; } #TOC .toctitle { font-weight: bold; font-size: 15px; margin-left: 5px; } #TOC ul { padding-left: 40px; margin-left: -1.5em; margin-top: 5px; margin-bottom: 5px; } #TOC ul ul { margin-left: -2em; } #TOC li { line-height: 16px; } table { margin: 1em auto; border-width: 1px; border-color: #DDDDDD; border-style: outset; border-collapse: collapse; } table th { border-width: 2px; padding: 5px; border-style: inset; } table td { border-width: 1px; border-style: inset; line-height: 18px; padding: 5px 5px; } table, table th, table td { border-left-style: none; border-right-style: none; } table thead, table tr.even { background-color: #f7f7f7; } p { margin: 0.5em 0; } blockquote { background-color: #f6f6f6; padding: 0.25em 0.75em; } hr { border-style: solid; border: none; border-top: 1px solid #777; margin: 28px 0; } dl { margin-left: 0; } dl dd { margin-bottom: 13px; margin-left: 13px; } dl dt { font-weight: bold; } ul { margin-top: 0; } ul li { list-style: circle outside; } ul ul { margin-bottom: 0; } pre, code { background-color: #f7f7f7; border-radius: 3px; color: #333; white-space: pre-wrap; } pre { border-radius: 3px; margin: 5px 0px 10px 0px; padding: 10px; } pre:not([class]) { background-color: #f7f7f7; } code { font-family: Consolas, Monaco, 'Courier New', monospace; font-size: 85%; } p > code, li > code { padding: 2px 0px; } div.figure { text-align: center; } img { background-color: #FFFFFF; padding: 2px; border: 1px solid #DDDDDD; border-radius: 3px; border: 1px solid #CCCCCC; margin: 0 5px; } h1 { margin-top: 0; font-size: 35px; line-height: 40px; } h2 { border-bottom: 4px solid #f7f7f7; padding-top: 10px; padding-bottom: 2px; font-size: 145%; } h3 { border-bottom: 2px solid #f7f7f7; padding-top: 10px; font-size: 120%; } h4 { border-bottom: 1px solid #f7f7f7; margin-left: 8px; font-size: 105%; } h5, h6 { border-bottom: 1px solid #ccc; font-size: 105%; } a { color: #0033dd; text-decoration: none; } a:hover { color: #6666ff; } a:visited { color: #800080; } a:visited:hover { color: #BB00BB; } a[href^="http:"] { text-decoration: underline; } a[href^="https:"] { text-decoration: underline; } code > span.kw { color: #555; font-weight: bold; } code > span.dt { color: #902000; } code > span.dv { color: #40a070; } code > span.bn { color: #d14; } code > span.fl { color: #d14; } code > span.ch { color: #d14; } code > span.st { color: #d14; } code > span.co { color: #888888; font-style: italic; } code > span.ot { color: #007020; } code > span.al { color: #ff0000; font-weight: bold; } code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; } </style> </head> <body> <h1 class="title toc-ignore">Introduction to stringr</h1> <p>There are four main families of functions in stringr:</p> <ol style="list-style-type: decimal"> <li><p>Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.</p></li> <li><p>Whitespace tools to add, remove, and manipulate whitespace.</p></li> <li><p>Locale sensitive operations whose operations will vary from locale to locale.</p></li> <li><p>Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.</p></li> </ol> <div id="getting-and-setting-individual-characters" class="section level2"> <h2>Getting and setting individual characters</h2> <p>You can get the length of the string with <code>str_length()</code>:</p> <div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw">str_length</span>(<span class="st">"abc"</span>)</a> <a class="sourceLine" id="cb1-2" data-line-number="2"><span class="co">#> [1] 3</span></a></code></pre></div> <p>This is now equivalent to the base R function <code>nchar()</code>. Previously it was needed to work around issues with <code>nchar()</code> such as the fact that it returned 2 for <code>nchar(NA)</code>. This has been fixed as of R 3.3.0, so it is no longer so important.</p> <p>You can access individual character using <code>str_sub()</code>. It takes three arguments: a character vector, a <code>start</code> position and an <code>end</code> position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.</p> <div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" data-line-number="1">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"abcdef"</span>, <span class="st">"ghifjk"</span>)</a> <a class="sourceLine" id="cb2-2" data-line-number="2"></a> <a class="sourceLine" id="cb2-3" data-line-number="3"><span class="co"># The 3rd letter</span></a> <a class="sourceLine" id="cb2-4" data-line-number="4"><span class="kw">str_sub</span>(x, <span class="dv">3</span>, <span class="dv">3</span>)</a> <a class="sourceLine" id="cb2-5" data-line-number="5"><span class="co">#> [1] "c" "i"</span></a> <a class="sourceLine" id="cb2-6" data-line-number="6"></a> <a class="sourceLine" id="cb2-7" data-line-number="7"><span class="co"># The 2nd to 2nd-to-last character</span></a> <a class="sourceLine" id="cb2-8" data-line-number="8"><span class="kw">str_sub</span>(x, <span class="dv">2</span>, <span class="dv">-2</span>)</a> <a class="sourceLine" id="cb2-9" data-line-number="9"><span class="co">#> [1] "bcde" "hifj"</span></a></code></pre></div> <p>You can also use <code>str_sub()</code> to modify strings:</p> <div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" data-line-number="1"><span class="kw">str_sub</span>(x, <span class="dv">3</span>, <span class="dv">3</span>) <-<span class="st"> "X"</span></a> <a class="sourceLine" id="cb3-2" data-line-number="2">x</a> <a class="sourceLine" id="cb3-3" data-line-number="3"><span class="co">#> [1] "abXdef" "ghXfjk"</span></a></code></pre></div> <p>To duplicate individual strings, you can use <code>str_dup()</code>:</p> <div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1"><span class="kw">str_dup</span>(x, <span class="kw">c</span>(<span class="dv">2</span>, <span class="dv">3</span>))</a> <a class="sourceLine" id="cb4-2" data-line-number="2"><span class="co">#> [1] "abXdefabXdef" "ghXfjkghXfjkghXfjk"</span></a></code></pre></div> </div> <div id="whitespace" class="section level2"> <h2>Whitespace</h2> <p>Three functions add, remove, or modify whitespace:</p> <ol style="list-style-type: decimal"> <li><p><code>str_pad()</code> pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.</p> <div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" data-line-number="1">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"abc"</span>, <span class="st">"defghi"</span>)</a> <a class="sourceLine" id="cb5-2" data-line-number="2"><span class="kw">str_pad</span>(x, <span class="dv">10</span>) <span class="co"># default pads on left</span></a> <a class="sourceLine" id="cb5-3" data-line-number="3"><span class="co">#> [1] " abc" " defghi"</span></a> <a class="sourceLine" id="cb5-4" data-line-number="4"><span class="kw">str_pad</span>(x, <span class="dv">10</span>, <span class="st">"both"</span>)</a> <a class="sourceLine" id="cb5-5" data-line-number="5"><span class="co">#> [1] " abc " " defghi "</span></a></code></pre></div> <p>(You can pad with other characters by using the <code>pad</code> argument.)</p> <p><code>str_pad()</code> will never make a string shorter:</p> <div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" data-line-number="1"><span class="kw">str_pad</span>(x, <span class="dv">4</span>)</a> <a class="sourceLine" id="cb6-2" data-line-number="2"><span class="co">#> [1] " abc" "defghi"</span></a></code></pre></div> <p>So if you want to ensure that all strings are the same length (often useful for print methods), combine <code>str_pad()</code> and <code>str_trunc()</code>:</p> <div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" data-line-number="1">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Short"</span>, <span class="st">"This is a long string"</span>)</a> <a class="sourceLine" id="cb7-2" data-line-number="2"></a> <a class="sourceLine" id="cb7-3" data-line-number="3">x <span class="op">%>%</span><span class="st"> </span></a> <a class="sourceLine" id="cb7-4" data-line-number="4"><span class="st"> </span><span class="kw">str_trunc</span>(<span class="dv">10</span>) <span class="op">%>%</span><span class="st"> </span></a> <a class="sourceLine" id="cb7-5" data-line-number="5"><span class="st"> </span><span class="kw">str_pad</span>(<span class="dv">10</span>, <span class="st">"right"</span>)</a> <a class="sourceLine" id="cb7-6" data-line-number="6"><span class="co">#> [1] "Short " "This is..."</span></a></code></pre></div></li> <li><p>The opposite of <code>str_pad()</code> is <code>str_trim()</code>, which removes leading and trailing whitespace:</p> <div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" data-line-number="1">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">" a "</span>, <span class="st">"b "</span>, <span class="st">" c"</span>)</a> <a class="sourceLine" id="cb8-2" data-line-number="2"><span class="kw">str_trim</span>(x)</a> <a class="sourceLine" id="cb8-3" data-line-number="3"><span class="co">#> [1] "a" "b" "c"</span></a> <a class="sourceLine" id="cb8-4" data-line-number="4"><span class="kw">str_trim</span>(x, <span class="st">"left"</span>)</a> <a class="sourceLine" id="cb8-5" data-line-number="5"><span class="co">#> [1] "a " "b " "c"</span></a></code></pre></div></li> <li><p>You can use <code>str_wrap()</code> to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.</p> <div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1">jabberwocky <-<span class="st"> </span><span class="kw">str_c</span>(</a> <a class="sourceLine" id="cb9-2" data-line-number="2"> <span class="st">"`Twas brillig, and the slithy toves "</span>,</a> <a class="sourceLine" id="cb9-3" data-line-number="3"> <span class="st">"did gyre and gimble in the wabe: "</span>,</a> <a class="sourceLine" id="cb9-4" data-line-number="4"> <span class="st">"All mimsy were the borogoves, "</span>,</a> <a class="sourceLine" id="cb9-5" data-line-number="5"> <span class="st">"and the mome raths outgrabe. "</span></a> <a class="sourceLine" id="cb9-6" data-line-number="6">)</a> <a class="sourceLine" id="cb9-7" data-line-number="7"><span class="kw">cat</span>(<span class="kw">str_wrap</span>(jabberwocky, <span class="dt">width =</span> <span class="dv">40</span>))</a> <a class="sourceLine" id="cb9-8" data-line-number="8"><span class="co">#> `Twas brillig, and the slithy toves did</span></a> <a class="sourceLine" id="cb9-9" data-line-number="9"><span class="co">#> gyre and gimble in the wabe: All mimsy</span></a> <a class="sourceLine" id="cb9-10" data-line-number="10"><span class="co">#> were the borogoves, and the mome raths</span></a> <a class="sourceLine" id="cb9-11" data-line-number="11"><span class="co">#> outgrabe.</span></a></code></pre></div></li> </ol> </div> <div id="locale-sensitive" class="section level2"> <h2>Locale sensitive</h2> <p>A handful of stringr functions are locale-sensitive: they will perform differently in different regions of the world. These functions are case transformation functions:</p> <div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" data-line-number="1">x <-<span class="st"> "I like horses."</span></a> <a class="sourceLine" id="cb10-2" data-line-number="2"><span class="kw">str_to_upper</span>(x)</a> <a class="sourceLine" id="cb10-3" data-line-number="3"><span class="co">#> [1] "I LIKE HORSES."</span></a> <a class="sourceLine" id="cb10-4" data-line-number="4"><span class="kw">str_to_title</span>(x)</a> <a class="sourceLine" id="cb10-5" data-line-number="5"><span class="co">#> [1] "I Like Horses."</span></a> <a class="sourceLine" id="cb10-6" data-line-number="6"></a> <a class="sourceLine" id="cb10-7" data-line-number="7"><span class="kw">str_to_lower</span>(x)</a> <a class="sourceLine" id="cb10-8" data-line-number="8"><span class="co">#> [1] "i like horses."</span></a> <a class="sourceLine" id="cb10-9" data-line-number="9"><span class="co"># Turkish has two sorts of i: with and without the dot</span></a> <a class="sourceLine" id="cb10-10" data-line-number="10"><span class="kw">str_to_lower</span>(x, <span class="st">"tr"</span>)</a> <a class="sourceLine" id="cb10-11" data-line-number="11"><span class="co">#> [1] "ı like horses."</span></a></code></pre></div> <p>String ordering and sorting:</p> <div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1">x <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"y"</span>, <span class="st">"i"</span>, <span class="st">"k"</span>)</a> <a class="sourceLine" id="cb11-2" data-line-number="2"><span class="kw">str_order</span>(x)</a> <a class="sourceLine" id="cb11-3" data-line-number="3"><span class="co">#> [1] 2 3 1</span></a> <a class="sourceLine" id="cb11-4" data-line-number="4"></a> <a class="sourceLine" id="cb11-5" data-line-number="5"><span class="kw">str_sort</span>(x)</a> <a class="sourceLine" id="cb11-6" data-line-number="6"><span class="co">#> [1] "i" "k" "y"</span></a> <a class="sourceLine" id="cb11-7" data-line-number="7"><span class="co"># In Lithuanian, y comes between i and k</span></a> <a class="sourceLine" id="cb11-8" data-line-number="8"><span class="kw">str_sort</span>(x, <span class="dt">locale =</span> <span class="st">"lt"</span>)</a> <a class="sourceLine" id="cb11-9" data-line-number="9"><span class="co">#> [1] "i" "y" "k"</span></a></code></pre></div> <p>The locale always defaults to English to ensure that the default behaviour is identical across systems. Locales always include a two letter ISO-639-1 language code (like “en” for English or “zh” for Chinese), and optionally a ISO-3166 country code (like “en_UK” vs “en_US”). You can see a complete list of available locales by running <code>stringi::stri_locale_list()</code>.</p> </div> <div id="pattern-matching" class="section level2"> <h2>Pattern matching</h2> <p>The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.</p> <div id="tasks" class="section level3"> <h3>Tasks</h3> <p>Each pattern matching function has the same first two arguments, a character vector of <code>string</code>s to process and a single <code>pattern</code> to match. stringr provides pattern matching functions to <strong>detect</strong>, <strong>locate</strong>, <strong>extract</strong>, <strong>match</strong>, <strong>replace</strong>, and <strong>split</strong> strings. I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:</p> <div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" data-line-number="1">strings <-<span class="st"> </span><span class="kw">c</span>(</a> <a class="sourceLine" id="cb12-2" data-line-number="2"> <span class="st">"apple"</span>, </a> <a class="sourceLine" id="cb12-3" data-line-number="3"> <span class="st">"219 733 8965"</span>, </a> <a class="sourceLine" id="cb12-4" data-line-number="4"> <span class="st">"329-293-8753"</span>, </a> <a class="sourceLine" id="cb12-5" data-line-number="5"> <span class="st">"Work: 579-499-7527; Home: 543.355.3679"</span></a> <a class="sourceLine" id="cb12-6" data-line-number="6">)</a> <a class="sourceLine" id="cb12-7" data-line-number="7">phone <-<span class="st"> "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"</span></a></code></pre></div> <ul> <li><p><code>str_detect()</code> detects the presence or absence of a pattern and returns a logical vector (similar to <code>grepl()</code>). <code>str_subset()</code> returns the elements of a character vector that match a regular expression (similar to <code>grep()</code> with <code>value = TRUE</code>)`.</p> <div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1"><span class="co"># Which strings contain phone numbers?</span></a> <a class="sourceLine" id="cb13-2" data-line-number="2"><span class="kw">str_detect</span>(strings, phone)</a> <a class="sourceLine" id="cb13-3" data-line-number="3"><span class="co">#> [1] FALSE TRUE TRUE TRUE</span></a> <a class="sourceLine" id="cb13-4" data-line-number="4"><span class="kw">str_subset</span>(strings, phone)</a> <a class="sourceLine" id="cb13-5" data-line-number="5"><span class="co">#> [1] "219 733 8965" </span></a> <a class="sourceLine" id="cb13-6" data-line-number="6"><span class="co">#> [2] "329-293-8753" </span></a> <a class="sourceLine" id="cb13-7" data-line-number="7"><span class="co">#> [3] "Work: 579-499-7527; Home: 543.355.3679"</span></a></code></pre></div></li> <li><p><code>str_count()</code> counts the number of matches:</p> <div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" data-line-number="1"><span class="co"># How many phone numbers in each string?</span></a> <a class="sourceLine" id="cb14-2" data-line-number="2"><span class="kw">str_count</span>(strings, phone)</a> <a class="sourceLine" id="cb14-3" data-line-number="3"><span class="co">#> [1] 0 1 1 2</span></a></code></pre></div></li> <li><p><code>str_locate()</code> locates the <strong>first</strong> position of a pattern and returns a numeric matrix with columns start and end. <code>str_locate_all()</code> locates all matches, returning a list of numeric matrices. Similar to <code>regexpr()</code> and <code>gregexpr()</code>.</p> <div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" data-line-number="1"><span class="co"># Where in the string is the phone number located?</span></a> <a class="sourceLine" id="cb15-2" data-line-number="2">(loc <-<span class="st"> </span><span class="kw">str_locate</span>(strings, phone))</a> <a class="sourceLine" id="cb15-3" data-line-number="3"><span class="co">#> start end</span></a> <a class="sourceLine" id="cb15-4" data-line-number="4"><span class="co">#> [1,] NA NA</span></a> <a class="sourceLine" id="cb15-5" data-line-number="5"><span class="co">#> [2,] 1 12</span></a> <a class="sourceLine" id="cb15-6" data-line-number="6"><span class="co">#> [3,] 1 12</span></a> <a class="sourceLine" id="cb15-7" data-line-number="7"><span class="co">#> [4,] 7 18</span></a> <a class="sourceLine" id="cb15-8" data-line-number="8"><span class="kw">str_locate_all</span>(strings, phone)</a> <a class="sourceLine" id="cb15-9" data-line-number="9"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb15-10" data-line-number="10"><span class="co">#> start end</span></a> <a class="sourceLine" id="cb15-11" data-line-number="11"><span class="co">#> </span></a> <a class="sourceLine" id="cb15-12" data-line-number="12"><span class="co">#> [[2]]</span></a> <a class="sourceLine" id="cb15-13" data-line-number="13"><span class="co">#> start end</span></a> <a class="sourceLine" id="cb15-14" data-line-number="14"><span class="co">#> [1,] 1 12</span></a> <a class="sourceLine" id="cb15-15" data-line-number="15"><span class="co">#> </span></a> <a class="sourceLine" id="cb15-16" data-line-number="16"><span class="co">#> [[3]]</span></a> <a class="sourceLine" id="cb15-17" data-line-number="17"><span class="co">#> start end</span></a> <a class="sourceLine" id="cb15-18" data-line-number="18"><span class="co">#> [1,] 1 12</span></a> <a class="sourceLine" id="cb15-19" data-line-number="19"><span class="co">#> </span></a> <a class="sourceLine" id="cb15-20" data-line-number="20"><span class="co">#> [[4]]</span></a> <a class="sourceLine" id="cb15-21" data-line-number="21"><span class="co">#> start end</span></a> <a class="sourceLine" id="cb15-22" data-line-number="22"><span class="co">#> [1,] 7 18</span></a> <a class="sourceLine" id="cb15-23" data-line-number="23"><span class="co">#> [2,] 27 38</span></a></code></pre></div></li> <li><p><code>str_extract()</code> extracts text corresponding to the <strong>first</strong> match, returning a character vector. <code>str_extract_all()</code> extracts all matches and returns a list of character vectors.</p> <div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" data-line-number="1"><span class="co"># What are the phone numbers?</span></a> <a class="sourceLine" id="cb16-2" data-line-number="2"><span class="kw">str_extract</span>(strings, phone)</a> <a class="sourceLine" id="cb16-3" data-line-number="3"><span class="co">#> [1] NA "219 733 8965" "329-293-8753" "579-499-7527"</span></a> <a class="sourceLine" id="cb16-4" data-line-number="4"><span class="kw">str_extract_all</span>(strings, phone)</a> <a class="sourceLine" id="cb16-5" data-line-number="5"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb16-6" data-line-number="6"><span class="co">#> character(0)</span></a> <a class="sourceLine" id="cb16-7" data-line-number="7"><span class="co">#> </span></a> <a class="sourceLine" id="cb16-8" data-line-number="8"><span class="co">#> [[2]]</span></a> <a class="sourceLine" id="cb16-9" data-line-number="9"><span class="co">#> [1] "219 733 8965"</span></a> <a class="sourceLine" id="cb16-10" data-line-number="10"><span class="co">#> </span></a> <a class="sourceLine" id="cb16-11" data-line-number="11"><span class="co">#> [[3]]</span></a> <a class="sourceLine" id="cb16-12" data-line-number="12"><span class="co">#> [1] "329-293-8753"</span></a> <a class="sourceLine" id="cb16-13" data-line-number="13"><span class="co">#> </span></a> <a class="sourceLine" id="cb16-14" data-line-number="14"><span class="co">#> [[4]]</span></a> <a class="sourceLine" id="cb16-15" data-line-number="15"><span class="co">#> [1] "579-499-7527" "543.355.3679"</span></a> <a class="sourceLine" id="cb16-16" data-line-number="16"><span class="kw">str_extract_all</span>(strings, phone, <span class="dt">simplify =</span> <span class="ot">TRUE</span>)</a> <a class="sourceLine" id="cb16-17" data-line-number="17"><span class="co">#> [,1] [,2] </span></a> <a class="sourceLine" id="cb16-18" data-line-number="18"><span class="co">#> [1,] "" "" </span></a> <a class="sourceLine" id="cb16-19" data-line-number="19"><span class="co">#> [2,] "219 733 8965" "" </span></a> <a class="sourceLine" id="cb16-20" data-line-number="20"><span class="co">#> [3,] "329-293-8753" "" </span></a> <a class="sourceLine" id="cb16-21" data-line-number="21"><span class="co">#> [4,] "579-499-7527" "543.355.3679"</span></a></code></pre></div></li> <li><p><code>str_match()</code> extracts capture groups formed by <code>()</code> from the <strong>first</strong> match. It returns a character matrix with one column for the complete match and one column for each group. <code>str_match_all()</code> extracts capture groups from all matches and returns a list of character matrices. Similar to <code>regmatches()</code>.</p> <div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb17-1" data-line-number="1"><span class="co"># Pull out the three components of the match</span></a> <a class="sourceLine" id="cb17-2" data-line-number="2"><span class="kw">str_match</span>(strings, phone)</a> <a class="sourceLine" id="cb17-3" data-line-number="3"><span class="co">#> [,1] [,2] [,3] [,4] </span></a> <a class="sourceLine" id="cb17-4" data-line-number="4"><span class="co">#> [1,] NA NA NA NA </span></a> <a class="sourceLine" id="cb17-5" data-line-number="5"><span class="co">#> [2,] "219 733 8965" "219" "733" "8965"</span></a> <a class="sourceLine" id="cb17-6" data-line-number="6"><span class="co">#> [3,] "329-293-8753" "329" "293" "8753"</span></a> <a class="sourceLine" id="cb17-7" data-line-number="7"><span class="co">#> [4,] "579-499-7527" "579" "499" "7527"</span></a> <a class="sourceLine" id="cb17-8" data-line-number="8"><span class="kw">str_match_all</span>(strings, phone)</a> <a class="sourceLine" id="cb17-9" data-line-number="9"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb17-10" data-line-number="10"><span class="co">#> [,1] [,2] [,3] [,4]</span></a> <a class="sourceLine" id="cb17-11" data-line-number="11"><span class="co">#> </span></a> <a class="sourceLine" id="cb17-12" data-line-number="12"><span class="co">#> [[2]]</span></a> <a class="sourceLine" id="cb17-13" data-line-number="13"><span class="co">#> [,1] [,2] [,3] [,4] </span></a> <a class="sourceLine" id="cb17-14" data-line-number="14"><span class="co">#> [1,] "219 733 8965" "219" "733" "8965"</span></a> <a class="sourceLine" id="cb17-15" data-line-number="15"><span class="co">#> </span></a> <a class="sourceLine" id="cb17-16" data-line-number="16"><span class="co">#> [[3]]</span></a> <a class="sourceLine" id="cb17-17" data-line-number="17"><span class="co">#> [,1] [,2] [,3] [,4] </span></a> <a class="sourceLine" id="cb17-18" data-line-number="18"><span class="co">#> [1,] "329-293-8753" "329" "293" "8753"</span></a> <a class="sourceLine" id="cb17-19" data-line-number="19"><span class="co">#> </span></a> <a class="sourceLine" id="cb17-20" data-line-number="20"><span class="co">#> [[4]]</span></a> <a class="sourceLine" id="cb17-21" data-line-number="21"><span class="co">#> [,1] [,2] [,3] [,4] </span></a> <a class="sourceLine" id="cb17-22" data-line-number="22"><span class="co">#> [1,] "579-499-7527" "579" "499" "7527"</span></a> <a class="sourceLine" id="cb17-23" data-line-number="23"><span class="co">#> [2,] "543.355.3679" "543" "355" "3679"</span></a></code></pre></div></li> <li><p><code>str_replace()</code> replaces the <strong>first</strong> matched pattern and returns a character vector. <code>str_replace_all()</code> replaces all matches. Similar to <code>sub()</code> and <code>gsub()</code>.</p> <div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" data-line-number="1"><span class="kw">str_replace</span>(strings, phone, <span class="st">"XXX-XXX-XXXX"</span>)</a> <a class="sourceLine" id="cb18-2" data-line-number="2"><span class="co">#> [1] "apple" </span></a> <a class="sourceLine" id="cb18-3" data-line-number="3"><span class="co">#> [2] "XXX-XXX-XXXX" </span></a> <a class="sourceLine" id="cb18-4" data-line-number="4"><span class="co">#> [3] "XXX-XXX-XXXX" </span></a> <a class="sourceLine" id="cb18-5" data-line-number="5"><span class="co">#> [4] "Work: XXX-XXX-XXXX; Home: 543.355.3679"</span></a> <a class="sourceLine" id="cb18-6" data-line-number="6"><span class="kw">str_replace_all</span>(strings, phone, <span class="st">"XXX-XXX-XXXX"</span>)</a> <a class="sourceLine" id="cb18-7" data-line-number="7"><span class="co">#> [1] "apple" </span></a> <a class="sourceLine" id="cb18-8" data-line-number="8"><span class="co">#> [2] "XXX-XXX-XXXX" </span></a> <a class="sourceLine" id="cb18-9" data-line-number="9"><span class="co">#> [3] "XXX-XXX-XXXX" </span></a> <a class="sourceLine" id="cb18-10" data-line-number="10"><span class="co">#> [4] "Work: XXX-XXX-XXXX; Home: XXX-XXX-XXXX"</span></a></code></pre></div></li> <li><p><code>str_split_fixed()</code> splits a string into a <strong>fixed</strong> number of pieces based on a pattern and returns a character matrix. <code>str_split()</code> splits a string into a <strong>variable</strong> number of pieces and returns a list of character vectors.</p> <div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" data-line-number="1"><span class="kw">str_split</span>(<span class="st">"a-b-c"</span>, <span class="st">"-"</span>)</a> <a class="sourceLine" id="cb19-2" data-line-number="2"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb19-3" data-line-number="3"><span class="co">#> [1] "a" "b" "c"</span></a> <a class="sourceLine" id="cb19-4" data-line-number="4"><span class="kw">str_split_fixed</span>(<span class="st">"a-b-c"</span>, <span class="st">"-"</span>, <span class="dt">n =</span> <span class="dv">2</span>)</a> <a class="sourceLine" id="cb19-5" data-line-number="5"><span class="co">#> [,1] [,2] </span></a> <a class="sourceLine" id="cb19-6" data-line-number="6"><span class="co">#> [1,] "a" "b-c"</span></a></code></pre></div></li> </ul> </div> <div id="engines" class="section level3"> <h3>Engines</h3> <p>There are four main engines that stringr can use to describe patterns:</p> <ul> <li><p>Regular expressions, the default, as shown above, and described in <code>vignette("regular-expressions")</code>.</p></li> <li><p>Fixed bytewise matching, with <code>fixed()</code>.</p></li> <li><p>Locale-sensitive character matching, with <code>coll()</code></p></li> <li><p>Text boundary analysis with <code>boundary()</code>.</p></li> </ul> <div id="fixed-matches" class="section level4"> <h4>Fixed matches</h4> <p><code>fixed(x)</code> only matches the exact sequence of bytes specified by <code>x</code>. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using <code>fixed()</code> with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent:</p> <div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb20-1" data-line-number="1">a1 <-<span class="st"> "\u00e1"</span></a> <a class="sourceLine" id="cb20-2" data-line-number="2">a2 <-<span class="st"> "a\u0301"</span></a> <a class="sourceLine" id="cb20-3" data-line-number="3"><span class="kw">c</span>(a1, a2)</a> <a class="sourceLine" id="cb20-4" data-line-number="4"><span class="co">#> [1] "á" "á"</span></a> <a class="sourceLine" id="cb20-5" data-line-number="5">a1 <span class="op">==</span><span class="st"> </span>a2</a> <a class="sourceLine" id="cb20-6" data-line-number="6"><span class="co">#> [1] FALSE</span></a></code></pre></div> <p>They render identically, but because they’re defined differently, <code>fixed()</code> doesn’t find a match. Instead, you can use <code>coll()</code>, explained below, to respect human character comparison rules:</p> <div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" data-line-number="1"><span class="kw">str_detect</span>(a1, <span class="kw">fixed</span>(a2))</a> <a class="sourceLine" id="cb21-2" data-line-number="2"><span class="co">#> [1] FALSE</span></a> <a class="sourceLine" id="cb21-3" data-line-number="3"><span class="kw">str_detect</span>(a1, <span class="kw">coll</span>(a2))</a> <a class="sourceLine" id="cb21-4" data-line-number="4"><span class="co">#> [1] TRUE</span></a></code></pre></div> </div> <div id="collation-search" class="section level4"> <h4>Collation search</h4> <p><code>coll(x)</code> looks for a match to <code>x</code> using human-language <strong>coll</strong>ation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a <code>locale</code> parameter.</p> <div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb22-1" data-line-number="1">i <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"I"</span>, <span class="st">"İ"</span>, <span class="st">"i"</span>, <span class="st">"ı"</span>)</a> <a class="sourceLine" id="cb22-2" data-line-number="2">i</a> <a class="sourceLine" id="cb22-3" data-line-number="3"><span class="co">#> [1] "I" "İ" "i" "ı"</span></a> <a class="sourceLine" id="cb22-4" data-line-number="4"></a> <a class="sourceLine" id="cb22-5" data-line-number="5"><span class="kw">str_subset</span>(i, <span class="kw">coll</span>(<span class="st">"i"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>))</a> <a class="sourceLine" id="cb22-6" data-line-number="6"><span class="co">#> [1] "I" "i"</span></a> <a class="sourceLine" id="cb22-7" data-line-number="7"><span class="kw">str_subset</span>(i, <span class="kw">coll</span>(<span class="st">"i"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>, <span class="dt">locale =</span> <span class="st">"tr"</span>))</a> <a class="sourceLine" id="cb22-8" data-line-number="8"><span class="co">#> [1] "İ" "i"</span></a></code></pre></div> <p>The downside of <code>coll()</code> is speed. Because the rules for recognising which characters are the same are complicated, <code>coll()</code> is relatively slow compared to <code>regex()</code> and <code>fixed()</code>. Note that when both <code>fixed()</code> and <code>regex()</code> have <code>ignore_case</code> arguments, they perform a much simpler comparison than <code>coll()</code>.</p> </div> <div id="boundary" class="section level4"> <h4>Boundary</h4> <p><code>boundary()</code> matches boundaries between characters, lines, sentences or words. It’s most useful with <code>str_split()</code>, but can be used with all pattern matching functions:</p> <div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb23-1" data-line-number="1">x <-<span class="st"> "This is a sentence."</span></a> <a class="sourceLine" id="cb23-2" data-line-number="2"><span class="kw">str_split</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))</a> <a class="sourceLine" id="cb23-3" data-line-number="3"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb23-4" data-line-number="4"><span class="co">#> [1] "This" "is" "a" "sentence"</span></a> <a class="sourceLine" id="cb23-5" data-line-number="5"><span class="kw">str_count</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))</a> <a class="sourceLine" id="cb23-6" data-line-number="6"><span class="co">#> [1] 4</span></a> <a class="sourceLine" id="cb23-7" data-line-number="7"><span class="kw">str_extract_all</span>(x, <span class="kw">boundary</span>(<span class="st">"word"</span>))</a> <a class="sourceLine" id="cb23-8" data-line-number="8"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb23-9" data-line-number="9"><span class="co">#> [1] "This" "is" "a" "sentence"</span></a></code></pre></div> <p>By convention, <code>""</code> is treated as <code>boundary("character")</code>:</p> <div class="sourceCode" id="cb24"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb24-1" data-line-number="1"><span class="kw">str_split</span>(x, <span class="st">""</span>)</a> <a class="sourceLine" id="cb24-2" data-line-number="2"><span class="co">#> [[1]]</span></a> <a class="sourceLine" id="cb24-3" data-line-number="3"><span class="co">#> [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "e" "n" "t" "e" "n" "c"</span></a> <a class="sourceLine" id="cb24-4" data-line-number="4"><span class="co">#> [18] "e" "."</span></a> <a class="sourceLine" id="cb24-5" data-line-number="5"><span class="kw">str_count</span>(x, <span class="st">""</span>)</a> <a class="sourceLine" id="cb24-6" data-line-number="6"><span class="co">#> [1] 19</span></a></code></pre></div> </div> </div> </div> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>