EVOLUTION-MANAGER
Edit File: shingle-tokenizers.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Character shingle tokenizers</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for tokenize_character_shingles {tokenizers}"><tr><td>tokenize_character_shingles {tokenizers}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Character shingle tokenizers</h2> <h3>Description</h3> <p>The character shingle tokenizer functions like an n-gram tokenizer, except the units that are shingled are characters instead of words. Options to the function let you determine whether non-alphanumeric characters like punctuation should be retained or discarded. </p> <h3>Usage</h3> <pre> tokenize_character_shingles( x, n = 3L, n_min = n, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE ) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>A character vector or a list of character vectors to be tokenized into character shingles. If <code>x</code> is a character vector, it can be of any length, and each element will be tokenized separately. If <code>x</code> is a list of character vectors, each element of the list should have a length of 1.</p> </td></tr> <tr valign="top"><td><code>n</code></td> <td> <p>The number of characters in each shingle. This must be an integer greater than or equal to 1.</p> </td></tr> <tr valign="top"><td><code>n_min</code></td> <td> <p>This must be an integer greater than or equal to 1, and less than or equal to <code>n</code>.</p> </td></tr> <tr valign="top"><td><code>lowercase</code></td> <td> <p>Should the characters be made lower case?</p> </td></tr> <tr valign="top"><td><code>strip_non_alphanum</code></td> <td> <p>Should punctuation and white space be stripped?</p> </td></tr> <tr valign="top"><td><code>simplify</code></td> <td> <p><code>FALSE</code> by default so that a consistent value is returned regardless of length of input. If <code>TRUE</code>, then an input with a single element will return a character vector of tokens instead of a list.</p> </td></tr> </table> <h3>Value</h3> <p>A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If <code>simplify = TRUE</code> and only a single element was passed as input, then the output is a character vector of tokens. </p> <h3>Examples</h3> <pre> x <- c("Now is the hour of our discontent") tokenize_character_shingles(x) tokenize_character_shingles(x, n = 5) tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE) tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE) </pre> <hr /><div style="text-align: center;">[Package <em>tokenizers</em> version 0.2.3 <a href="00Index.html">Index</a>]</div> </body></html>