EVOLUTION-MANAGER
Edit File: stri_enc_detect.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Detect Character Set and Language</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for stri_enc_detect {stringi}"><tr><td>stri_enc_detect {stringi}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Detect Character Set and Language</h2> <h3>Description</h3> <p>This function uses the <span class="pkg">ICU</span> engine to determine the character set, or encoding, of character data in an unknown format. </p> <h3>Usage</h3> <pre> stri_enc_detect(str, filter_angle_brackets = FALSE) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>str</code></td> <td> <p>character vector, a raw vector, or a list of <code>raw</code> vectors</p> </td></tr> <tr valign="top"><td><code>filter_angle_brackets</code></td> <td> <p>logical; If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or XML markup.</p> </td></tr> </table> <h3>Details</h3> <p>Vectorized over <code>str</code> and <code>filter_angle_brackets</code>. </p> <p>This is, at best, an imprecise operation using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that is mostly in a single language. However, because the detection only looks at a limited amount of the input data, some of the returned character sets may fail to handle all of the input data. Note that in some cases, the language can be determined along with the encoding. </p> <p>Several different techniques are used for character set detection. For multi-byte encodings, the sequence of bytes is checked for legible patterns. The detected characters are also checked against a list of frequently used characters in that encoding. For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding. </p> <p>The detection process can be configured to optionally ignore HTML or XML style markup (using <span class="pkg">ICU</span>'s internal facilities), which can interfere with the detection process by changing the statistics. </p> <p>This function should most often be used for byte-marked input strings, especially after loading them from text files and before the main conversion with <code><a href="stri_encode.html">stri_encode</a></code>. The input encoding is of course not taken into account here, even if marked. </p> <p>The following table shows all the encodings that can be detected: </p> <table summary="Rd table"> <tr> <td style="text-align: left;"> <strong>Character_Set</strong> </td><td style="text-align: left;"> <strong>Languages</strong></td> </tr> <tr> <td style="text-align: left;"> UTF-8 </td><td style="text-align: left;"> -- </td> </tr> <tr> <td style="text-align: left;"> UTF-16BE </td><td style="text-align: left;"> -- </td> </tr> <tr> <td style="text-align: left;"> UTF-16LE </td><td style="text-align: left;"> -- </td> </tr> <tr> <td style="text-align: left;"> UTF-32BE </td><td style="text-align: left;"> -- </td> </tr> <tr> <td style="text-align: left;"> UTF-32LE </td><td style="text-align: left;"> -- </td> </tr> <tr> <td style="text-align: left;"> Shift_JIS </td><td style="text-align: left;"> Japanese </td> </tr> <tr> <td style="text-align: left;"> ISO-2022-JP </td><td style="text-align: left;"> Japanese </td> </tr> <tr> <td style="text-align: left;"> ISO-2022-CN </td><td style="text-align: left;"> Simplified Chinese </td> </tr> <tr> <td style="text-align: left;"> ISO-2022-KR </td><td style="text-align: left;"> Korean </td> </tr> <tr> <td style="text-align: left;"> GB18030 </td><td style="text-align: left;"> Chinese </td> </tr> <tr> <td style="text-align: left;"> Big5 </td><td style="text-align: left;"> Traditional Chinese </td> </tr> <tr> <td style="text-align: left;"> EUC-JP </td><td style="text-align: left;"> Japanese </td> </tr> <tr> <td style="text-align: left;"> EUC-KR </td><td style="text-align: left;"> Korean </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-1 </td><td style="text-align: left;"> Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-2 </td><td style="text-align: left;"> Czech, Hungarian, Polish, Romanian </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-5 </td><td style="text-align: left;"> Russian </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-6 </td><td style="text-align: left;"> Arabic </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-7 </td><td style="text-align: left;"> Greek </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-8 </td><td style="text-align: left;"> Hebrew </td> </tr> <tr> <td style="text-align: left;"> ISO-8859-9 </td><td style="text-align: left;"> Turkish </td> </tr> <tr> <td style="text-align: left;"> windows-1250 </td><td style="text-align: left;"> Czech, Hungarian, Polish, Romanian </td> </tr> <tr> <td style="text-align: left;"> windows-1251 </td><td style="text-align: left;"> Russian </td> </tr> <tr> <td style="text-align: left;"> windows-1252 </td><td style="text-align: left;"> Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish </td> </tr> <tr> <td style="text-align: left;"> windows-1253 </td><td style="text-align: left;"> Greek </td> </tr> <tr> <td style="text-align: left;"> windows-1254 </td><td style="text-align: left;"> Turkish </td> </tr> <tr> <td style="text-align: left;"> windows-1255 </td><td style="text-align: left;"> Hebrew </td> </tr> <tr> <td style="text-align: left;"> windows-1256 </td><td style="text-align: left;"> Arabic </td> </tr> <tr> <td style="text-align: left;"> KOI8-R </td><td style="text-align: left;"> Russian </td> </tr> <tr> <td style="text-align: left;"> IBM420 </td><td style="text-align: left;"> Arabic </td> </tr> <tr> <td style="text-align: left;"> IBM424 </td><td style="text-align: left;"> Hebrew </td> </tr> <tr> <td style="text-align: left;"> </td> </tr> </table> <p>If you have some initial guess at language and encoding, try with <code><a href="stri_enc_detect2.html">stri_enc_detect2</a></code>. </p> <h3>Value</h3> <p>Returns a list of length equal to the length of <code>str</code>. Each list element is a data frame with the following three named vectors representing all the guesses: </p> <ul> <li> <p><code>Encoding</code> – string; guessed encodings; <code>NA</code> on failure, </p> </li> <li> <p><code>Language</code> – string; guessed languages; <code>NA</code> if the language could not be determined (e.g., in case of UTF-8), </p> </li> <li> <p><code>Confidence</code> – numeric in [0,1]; the higher the value, the more confidence there is in the match; <code>NA</code> on failure. </p> </li></ul> <p>The guesses are ordered by decreasing confidence. </p> <h3>References</h3> <p><em>Character Set Detection</em> – ICU User Guide, <a href="http://userguide.icu-project.org/conversion/detection">http://userguide.icu-project.org/conversion/detection</a> </p> <h3>See Also</h3> <p>Other encoding_detection: <code><a href="stri_enc_detect2.html">stri_enc_detect2</a>()</code>, <code><a href="stri_enc_isascii.html">stri_enc_isascii</a>()</code>, <code><a href="stri_enc_isutf16.html">stri_enc_isutf16be</a>()</code>, <code><a href="stri_enc_isutf8.html">stri_enc_isutf8</a>()</code>, <code><a href="stringi-encoding.html">stringi-encoding</a></code> </p> <h3>Examples</h3> <pre> ## Not run: f <- rawToChar(readBin("test.txt", "raw", 100000)) stri_enc_detect(f) ## End(Not run) </pre> <hr /><div style="text-align: center;">[Package <em>stringi</em> version 1.4.6 <a href="00Index.html">Index</a>]</div> </body></html>