EVOLUTION-MANAGER
Edit File: chunk_text.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Chunk text into smaller segments</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for chunk_text {tokenizers}"><tr><td>chunk_text {tokenizers}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Chunk text into smaller segments</h2> <h3>Description</h3> <p>Given a text or vector/list of texts, break the texts into smaller segments each with the same number of words. This allows you to treat a very long document, such as a novel, as a set of smaller documents. </p> <h3>Usage</h3> <pre> chunk_text(x, chunk_size = 100, doc_id = names(x), ...) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>A character vector or a list of character vectors to be tokenized into n-grams. If <code>x</code> is a character vector, it can be of any length, and each element will be chunked separately. If <code>x</code> is a list of character vectors, each element of the list should have a length of 1.</p> </td></tr> <tr valign="top"><td><code>chunk_size</code></td> <td> <p>The number of words in each chunk.</p> </td></tr> <tr valign="top"><td><code>doc_id</code></td> <td> <p>The document IDs as a character vector. This will be taken from the names of the <code>x</code> vector if available. <code>NULL</code> is acceptable.</p> </td></tr> <tr valign="top"><td><code>...</code></td> <td> <p>Arguments passed on to <code><a href="basic-tokenizers.html">tokenize_words</a></code>.</p> </td></tr> </table> <h3>Details</h3> <p>Chunking the text passes it through <code><a href="basic-tokenizers.html">tokenize_words</a></code>, which will strip punctuation and lowercase the text unless you provide arguments to pass along to that function. </p> <h3>Examples</h3> <pre> ## Not run: chunked <- chunk_text(mobydick, chunk_size = 100) length(chunked) chunked[1:3] ## End(Not run) </pre> <hr /><div style="text-align: center;">[Package <em>tokenizers</em> version 0.2.3 <a href="00Index.html">Index</a>]</div> </body></html>