EVOLUTION-MANAGER
Edit File: icuSetCollate.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Setup Collation by ICU</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for icuSetCollate {base}"><tr><td>icuSetCollate {base}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2> Setup Collation by ICU </h2> <h3>Description</h3> <p>Controls the way collation is done by ICU (an optional part of the <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> build). </p> <h3>Usage</h3> <pre> icuSetCollate(...) icuGetCollate(type = c("actual", "valid")) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>...</code></td> <td> <p>Named arguments, see ‘Details’.</p> </td></tr> <tr valign="top"><td><code>type</code></td> <td> <p>character string: can be abbreviated. Either the actual locale in use for collation or the most specific locale which would be valid.</p> </td></tr> </table> <h3>Details</h3> <p>Optionally, <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> can be built to collate character strings by ICU (<a href="http://site.icu-project.org">http://site.icu-project.org</a>). For such systems, <code>icuSetCollate</code> can be used to tune the way collation is done. On other builds calling this function does nothing, with a warning. </p> <p>Possible arguments are </p> <dl> <dt><code>locale</code>:</dt><dd><p>A character string such as <code>"da_DK"</code> giving the language and country whose collation rules are to be used. If present, this should be the first argument.</p> </dd> <dt><code>case_first</code>:</dt><dd><p><code>"upper"</code>, <code>"lower"</code> or <code>"default"</code>, asking for upper- or lower-case characters to be sorted first. The default is usually lower-case first, but not in all languages (not under the default settings for Danish, for example).</p> </dd> <dt><code>alternate_handling</code>:</dt><dd><p>Controls the handling of ‘variable’ characters (mainly punctuation and symbols). Possible values are <code>"non_ignorable"</code> (primary strength) and <code>"shifted"</code> (quaternary strength).</p> </dd> <dt><code>strength</code>:</dt><dd><p>Which components should be used? Possible values <code>"primary"</code>, <code>"secondary"</code>, <code>"tertiary"</code> (default), <code>"quaternary"</code> and <code>"identical"</code>. </p> </dd> <dt><code>french_collation</code>:</dt><dd><p>In a French locale the way accents affect collation is from right to left, whereas in most other locales it is from left to right. Possible values <code>"on"</code>, <code>"off"</code> and <code>"default"</code>.</p> </dd> <dt><code>normalization</code>:</dt><dd><p>Should strings be normalized? Possible values are <code>"on"</code> and <code>"off"</code> (default). This affects the collation of composite characters.</p> </dd> <dt><code>case_level</code>:</dt><dd><p>An additional level between secondary and tertiary, used to distinguish large and small Japanese Kana characters. Possible values <code>"on"</code> and <code>"off"</code> (default).</p> </dd> <dt><code>hiragana_quaternary</code>:</dt><dd><p>Possible values <code>"on"</code> (sort Hiragana first at quaternary level) and <code>"off"</code>.</p> </dd> </dl> <p>Only the first three are likely to be of interest except to those with a detailed understanding of collation and specialized requirements. </p> <p>Some special values are accepted for <code>locale</code>: </p> <dl> <dt><code>"none"</code>:</dt><dd><p>ICU is not used for collation: the OS's collation services are used instead.</p> </dd> <dt><code>"ASCII"</code>:</dt><dd><p>ICU is not used for collation: the C function <code>strcmp</code> is used instead, which should sort byte-by-byte in (unsigned) numerical order.</p> </dd> <dt><code>"default"</code>:</dt><dd> <p>obtains the locale from the OS as is done at the start of the session. If environment variable <span class="env">R_ICU_LOCALE</span> is set to a non-empty value, its value is used rather than consulting the OS, unless environment variable <span class="env">LC_ALL</span> is set to 'C' (or unset but <span class="env">LC_COLLATE</span> is set to 'C'). </p> </dd> <dt><code>""</code>, <code>"root"</code>:</dt><dd> <p>the ‘root’ collation: see <a href="http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation">http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation</a>. </p> </dd> </dl> <p>For the specifications of ‘real’ ICU locales, see <a href="http://userguide.icu-project.org/locale">http://userguide.icu-project.org/locale</a>. Note that ICU does not report that a locale is not supported, but falls back to its idea of ‘best fit’ (which could be rather different and is reported by <code>icuGetCollate("actual")</code>, often <code>"root"</code>). Most English locales fall back to <code>"root"</code> as although e.g. <code>"en_GB"</code> is a valid locale (at least on some platforms), it contains no special rules for collation. Note that <code>"C"</code> is not a supported ICU locale and hence <span class="env">R_ICU_LOCALE</span> should never be set to <code>"C"</code>. </p> <p>Some examples are <code>case_level = "on", strength = "primary"</code> to ignore accent differences and <code>alternate_handling = "shifted"</code> to ignore space and punctuation characters. </p> <p>Initially ICU will not be used for collation if the OS is set to use the <code>C</code> locale for collation and <span class="env">R_ICU_LOCALE</span> is not set. Once this function is called with a value for <code>locale</code>, ICU will be used until it is called again with <code>locale = "none"</code>. ICU will not be used once <code>Sys.setlocale</code> is called with a <code>"C"</code> value for <code>LC_ALL</code> or <code>LC_COLLATE</code>, even if <span class="env">R_ICU_LOCALE</span> is set. ICU will be used again honoring <span class="env">R_ICU_LOCALE</span> once <code>Sys.setlocale</code> is called to set a different collation order. Environment variables <span class="env">LC_ALL</span> (or <span class="env">LC_COLLATE</span>) take precedence over <span class="env">R_ICU_LOCALE</span> if and only if they are set to 'C'. Due to the interaction with other ways of setting the collation order, <span class="env">R_ICU_LOCALE</span> should be used with care and only when needed. </p> <p>All customizations are reset to the default for the locale if <code>locale</code> is specified: the collation engine is reset if the OS collation locate category is changed by <code><a href="locales.html">Sys.setlocale</a></code>. </p> <h3>Value</h3> <p>For <code>icuGetCollate</code>, a character string describing the ICU locale in use (which may be reported as <code>"ICU not in use"</code>). The ‘actual’ locale may be simpler than the requested locale: for example <code>"da"</code> rather than <code>"da_DK"</code>: English locales are likely to report <code>"root"</code>. </p> <h3>Note</h3> <p>ICU is used by default wherever it is available: this include macOS, Solaris and many Linux installations. As it works internally in UTF-8, it will be most efficient in UTF-8 locales. </p> <p>It is optional on Windows: if <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> has been built against ICU, it will only be used if environment variable <span class="env">R_ICU_LOCALE</span> is set or once <code>icuSetCollate</code> is called to select the locale (as ICU and Windows differ in their idea of locale names). Note that <code>icuSetCollate(locale = "default")</code> should work reasonably well for <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> >= 3.2.0 and Windows Vista/Server 2008 and later (but finds the system default ignoring environment variables such as <span class="env">LC_COLLATE</span>). </p> <h3>See Also</h3> <p><a href="Comparison.html">Comparison</a>, <code><a href="sort.html">sort</a></code>. </p> <p><code><a href="capabilities.html">capabilities</a></code> for whether ICU is available; <code><a href="extSoftVersion.html">extSoftVersion</a></code> for its version. </p> <p>The ICU user guide chapter on collation (<a href="http://userguide.icu-project.org/collation">http://userguide.icu-project.org/collation</a>). </p> <h3>Examples</h3> <pre> ## These examples depend on having ICU available, and on the locale. ## As we don't know the current settings, we can only reset to the default. if(capabilities("ICU")) { print(icuGetCollate()) print(icuGetCollate("valid")) x <- c("Aarhus", "aarhus", "safe", "test", "Zoo") print(sort(x)) icuSetCollate(case_first = "upper"); print(sort(x)) icuSetCollate(case_first = "lower"); print(sort(x)) ## Danish collates upper-case-first and with 'aa' as a single letter icuSetCollate(locale = "da_DK", case_first = "default"); print(sort(x)) ## Estonian collates Z between S and T icuSetCollate(locale = "et_EE"); print(sort(x)) icuSetCollate(locale = "default"); print(icuGetCollate("valid")) } </pre> <hr /><div style="text-align: center;">[Package <em>base</em> version 3.6.0 <a href="00Index.html">Index</a>]</div> </body></html>