EVOLUTION-MANAGER
Edit File: hunspell.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Hunspell Spell Checking and Morphological Analysis</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for hunspell {hunspell}"><tr><td>hunspell {hunspell}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Hunspell Spell Checking and Morphological Analysis</h2> <h3>Description</h3> <p>The <code><a href="hunspell.html">hunspell</a></code> function is a high-level wrapper for finding spelling errors within a text document. It takes a character vector with text (<code>plain</code>, <code>latex</code>, <code>man</code>, <code>html</code> or <code>xml</code> format), parses out the words and returns a list with incorrect words for each line. It effectively combines <code><a href="hunspell.html">hunspell_parse</a></code> with <code><a href="hunspell.html">hunspell_check</a></code> in a single step. Other functions in the package operate on individual words, see details. </p> <h3>Usage</h3> <pre> hunspell( text, format = c("text", "man", "latex", "html", "xml"), dict = dictionary("en_US"), ignore = en_stats ) hunspell_parse( text, format = c("text", "man", "latex", "html", "xml"), dict = dictionary("en_US") ) hunspell_check(words, dict = dictionary("en_US")) hunspell_suggest(words, dict = dictionary("en_US")) hunspell_analyze(words, dict = dictionary("en_US")) hunspell_stem(words, dict = dictionary("en_US")) hunspell_info(dict = dictionary("en_US")) dictionary(lang = "en_US", affix = NULL, add_words = NULL, cache = TRUE) list_dictionaries() </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>text</code></td> <td> <p>character vector with arbitrary input text</p> </td></tr> <tr valign="top"><td><code>format</code></td> <td> <p>input format; supported parsers are <code>text</code>, <code>latex</code>, <code>man</code>, <code>xml</code> and <code>html</code>.</p> </td></tr> <tr valign="top"><td><code>dict</code></td> <td> <p>a dictionary object or string which can be passed to <code><a href="hunspell.html">dictionary</a></code>.</p> </td></tr> <tr valign="top"><td><code>ignore</code></td> <td> <p>character vector with additional approved words added to the dictionary</p> </td></tr> <tr valign="top"><td><code>words</code></td> <td> <p>character vector with individual words to spell check</p> </td></tr> <tr valign="top"><td><code>lang</code></td> <td> <p>dictionary file or language, see details</p> </td></tr> <tr valign="top"><td><code>affix</code></td> <td> <p>file path to corresponding affix file. If <code>NULL</code> it is is assumed to be the same path as <code>dict</code> with extension <code>.aff</code>.</p> </td></tr> <tr valign="top"><td><code>add_words</code></td> <td> <p>a character vector of additional words to add to the dictionary</p> </td></tr> <tr valign="top"><td><code>cache</code></td> <td> <p>speed up loading of dictionaries by caching</p> </td></tr> </table> <h3>Details</h3> <p>Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language. The <code><a href="hunspell.html">hunspell_analyze</a></code> function shows how a word breaks down into a valid stem plus affix. The <code><a href="hunspell.html">hunspell_stem</a></code> function is similar but only returns valid stems for a given word. Stemming can be used to summarize text (e.g in a wordcloud). The <code><a href="hunspell.html">hunspell_check</a></code> function takes a vector of individual words and tests each one for correctness. Finally <code><a href="hunspell.html">hunspell_suggest</a></code> is used to suggest correct alternatives for each (incorrect) input word. </p> <p>Because spell checking is usually done on a document, the package includes some parsers to extract words from various common formats. With <code><a href="hunspell.html">hunspell_parse</a></code> we can parse plain-text, latex and man format. R also has a few built-in parsers such as <code><a href="../../tools/html/RdTextFilter.html">RdTextFilter</a></code> and <code><a href="../../tools/html/SweaveTeXFilter.html">SweaveTeXFilter</a></code>, see also <code><a href="../../utils/html/aspell.html">?aspell</a></code>. </p> <p>The package searches for dictionaries in the working directory as well as in the standard system locations. <code><a href="hunspell.html">list_dictionaries</a></code> provides a list of all dictionaries it can find. Additional search paths can be specified by setting the <code>DICPATH</code> environment variable. A US English dictionary (<code>en_US</code>) is included with the package; other dictionaries need to be installed by the system. Most operating systems already include compatible dictionaries with names such as <a href="https://packages.debian.org/sid/hunspell-en-gb">hunspell-en-gb</a> or <a href="https://packages.debian.org/sid/myspell-en-gb">myspell-en-gb</a>. </p> <p>To manually install dictionaries, copy the corresponding <code>.aff</code> and <code>.dic</code> file to <code>~/Library/Spelling</code> or a custom directory specified in <code>DICPATH</code>. Alternatively you can pass the entire path to the <code>.dic</code> file as the <code>dict</code> parameter. Some popular sources of dictionaries are <a href="http://wordlist.aspell.net/dicts/">SCOWL</a>, <a href="http://openoffice.cs.utah.edu/contrib/dictionaries/">OpenOffice</a>, <a href="http://archive.ubuntu.com/ubuntu/pool/main/libr/libreoffice-dictionaries/?C=S;O=D">debian</a>, <a href="https://github.com/titoBouzout/Dictionaries">github/titoBouzout</a> or <a href="https://github.com/wooorm/dictionaries">github/wooorm</a>. </p> <p>Note that <code>hunspell</code> uses <code><a href="../../base/html/iconv.html">iconv</a></code> to convert input text to the encoding used by the dictionary. This will fail if <code>text</code> contains characters which are unsupported by that particular encoding. For this reason UTF-8 dictionaries are preferable over legacy 8-bit dictionaries. </p> <h3>Examples</h3> <pre> # Check individual words words <- c("beer", "wiskey", "wine") correct <- hunspell_check(words) print(correct) # Find suggestions for incorrect words hunspell_suggest(words[!correct]) # Extract incorrect from a piece of text bad <- hunspell("spell checkers are not neccessairy for langauge ninja's") print(bad[[1]]) hunspell_suggest(bad[[1]]) # Stemming words <- c("love", "loving", "lovingly", "loved", "lover", "lovely", "love") hunspell_stem(words) hunspell_analyze(words) # Check an entire latex document tmpfile <- file.path(tempdir(), "1406.4806v1.tar.gz") download.file("https://arxiv.org/e-print/1406.4806v1", tmpfile, mode = "wb") untar(tmpfile, exdir = tempdir()) text <- readLines(file.path(tempdir(), "content.tex"), warn = FALSE) bad_words <- hunspell(text, format = "latex") sort(unique(unlist(bad_words))) # Summarize text by stems (e.g. for wordcloud) allwords <- hunspell_parse(text, format = "latex") stems <- unlist(hunspell_stem(unlist(allwords))) words <- head(sort(table(stems), decreasing = TRUE), 200) </pre> <hr /><div style="text-align: center;">[Package <em>hunspell</em> version 3.0.2 <a href="00Index.html">Index</a>]</div> </body></html>