EVOLUTION-MANAGER
Edit File: stem-tokenizers.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Word stem tokenizer</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for tokenize_word_stems {tokenizers}"><tr><td>tokenize_word_stems {tokenizers}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Word stem tokenizer</h2> <h3>Description</h3> <p>This function turns its input into a character vector of word stems. This is just a wrapper around the <code><a href="../../SnowballC/html/wordStem.html">wordStem</a></code> function from the SnowballC package which does the heavy lifting, but this function provides a consistent interface with the rest of the tokenizers in this package. The input can be a character vector of any length, or a list of character vectors where each character vector in the list has a length of 1. </p> <h3>Usage</h3> <pre> tokenize_word_stems( x, language = "english", stopwords = NULL, simplify = FALSE ) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>A character vector or a list of character vectors to be tokenized. If <code>x</code> is a character vector, it can be of any length, and each element will be tokenized separately. If <code>x</code> is a list of character vectors, where each element of the list should have a length of 1.</p> </td></tr> <tr valign="top"><td><code>language</code></td> <td> <p>The language to use for word stemming. This must be one of the languages available in the SnowballC package. A list is provided by <code><a href="../../SnowballC/html/getStemLanguages.html">getStemLanguages</a></code>.</p> </td></tr> <tr valign="top"><td><code>stopwords</code></td> <td> <p>A character vector of stop words to be excluded</p> </td></tr> <tr valign="top"><td><code>simplify</code></td> <td> <p><code>FALSE</code> by default so that a consistent value is returned regardless of length of input. If <code>TRUE</code>, then an input with a single element will return a character vector of tokens instead of a list.</p> </td></tr> </table> <h3>Details</h3> <p>This function will strip all white space and punctuation and make all word stems lowercase. </p> <h3>Value</h3> <p>A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If <code>simplify = TRUE</code> and only a single element was passed as input, then the output is a character vector of tokens. </p> <h3>See Also</h3> <p><code><a href="../../SnowballC/html/wordStem.html">wordStem</a></code> </p> <h3>Examples</h3> <pre> song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_word_stems(song) </pre> <hr /><div style="text-align: center;">[Package <em>tokenizers</em> version 0.2.3 <a href="00Index.html">Index</a>]</div> </body></html>