EVOLUTION-MANAGER
Edit File: validUTF8.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Check if a Character Vector is Validly Encoded</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for validUTF8 {base}"><tr><td>validUTF8 {base}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Check if a Character Vector is Validly Encoded</h2> <h3>Description</h3> <p>Check if each element of a character vector is valid in its implied encoding. </p> <h3>Usage</h3> <pre> validUTF8(x) validEnc(x) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>a character vector.</p> </td></tr> </table> <h3>Details</h3> <p>These use similar checks to those used by functions such as <code><a href="grep.html">grep</a></code>. </p> <p><code>validUTF8</code> ignores any marked encoding (see <code><a href="Encoding.html">Encoding</a></code>) and so looks directly if the bytes in each string are valid UTF-8. </p> <p><code>validEnc</code> regards character strings as validly encoded unless their encodings are marked as UTF-8 or they are unmarked and the <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> session is in a UTF-8 or other multi-byte locale. (The checks in other multi-byte locales depend on the OS and as with <code><a href="iconv.html">iconv</a></code> not all invalid inputs may be detected.) </p> <h3>Value</h3> <p>A logical vector of the same length as <code>x</code>. <code>NA</code> elements are regarded as validly encoded. </p> <h3>Note</h3> <p>It would be possible to check for the validity of character strings in a Latin-1 encoding, but extensions such as CP1252 are widely accepted as ‘Latin-1’ and 8-bit encodings rarely need to be checked for validity. </p> <h3>Examples</h3> <pre> x <- ## from example(text) c("Jetz", "no", "chli", "z\xc3\xbcrit\xc3\xbc\xc3\xbctsch:", "(noch", "ein", "bi\xc3\x9fchen", "Z\xc3\xbc", "deutsch)", ## from a CRAN check log "\xfa\xb4\xbf\xbf\x9f") validUTF8(x) validEnc(x) # depends on the locale Encoding(x) <-"UTF-8" validEnc(x) # typically the last, x[10], is invalid ## Maybe advantageous to declare it "unknown": G <- x ; Encoding(G[!validEnc(G)]) <- "unknown" try( substr(x, 1,1) ) # gives 'invalid multibyte string' error try( substr(G, 1,1) ) # works nchar(G) # fine, too ## but it is not "more valid" typically: all.equal(validEnc(x), validEnc(G)) # typically TRUE </pre> <hr /><div style="text-align: center;">[Package <em>base</em> version 3.6.0 <a href="00Index.html">Index</a>]</div> </body></html>