EVOLUTION-MANAGER
Edit File: ptb-tokenizer.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Penn Treebank Tokenizer</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for tokenize_ptb {tokenizers}"><tr><td>tokenize_ptb {tokenizers}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Penn Treebank Tokenizer</h2> <h3>Description</h3> <p>This function implements the Penn Treebank word tokenizer. </p> <h3>Usage</h3> <pre> tokenize_ptb(x, lowercase = FALSE, simplify = FALSE) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>A character vector or a list of character vectors to be tokenized into n-grams. If <code>x</code> is a character vector, it can be of any length, and each element will be tokenized separately. If <code>x</code> is a list of character vectors, each element of the list should have a length of 1.</p> </td></tr> <tr valign="top"><td><code>lowercase</code></td> <td> <p>Should the tokens be made lower case?</p> </td></tr> <tr valign="top"><td><code>simplify</code></td> <td> <p><code>FALSE</code> by default so that a consistent value is returned regardless of length of input. If <code>TRUE</code>, then an input with a single element will return a character vector of tokens instead of a list.</p> </td></tr> </table> <h3>Details</h3> <p>This tokenizer uses regular expressions to tokenize text similar to the tokenization used in the Penn Treebank. It assumes that text has already been split into sentences. The tokenizer does the following: </p> <ul> <li><p>splits common English contractions, e.g. <code style="white-space: pre;">don't</code> is tokenized into <code style="white-space: pre;">do n't</code> and <code style="white-space: pre;">they'll</code> is tokenized into -> <code style="white-space: pre;">they 'll</code>, </p> </li> <li><p>handles punctuation characters as separate tokens, </p> </li> <li><p>splits commas and single quotes off from words, when they are followed by whitespace, </p> </li> <li><p>splits off periods that occur at the end of the sentence. </p> </li></ul> <p>This function is a port of the Python NLTK version of the Penn Treebank Tokenizer. </p> <h3>Value</h3> <p>A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If <code>simplify = TRUE</code> and only a single element was passed as input, then the output is a character vector of tokens. </p> <h3>References</h3> <p><a href="https://www.nltk.org/_modules/nltk/tokenize/treebank.html#TreebankWordTokenizer">NLTK TreebankWordTokenizer</a> </p> <h3>Examples</h3> <pre> song <- list(paste0("How many roads must a man walk down\n", "Before you call him a man?"), paste0("How many seas must a white dove sail\n", "Before she sleeps in the sand?\n"), paste0("How many times must the cannonballs fly\n", "Before they're forever banned?\n"), "The answer, my friend, is blowin' in the wind.", "The answer is blowin' in the wind.") tokenize_ptb(song) tokenize_ptb(c("Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.", "They'll save and invest more.", "Hi, I can't say hello.")) </pre> <hr /><div style="text-align: center;">[Package <em>tokenizers</em> version 0.2.3 <a href="00Index.html">Index</a>]</div> </body></html>