EVOLUTION-MANAGER
Edit File: data.table.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Enhanced data.frame</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for data.table-package {data.table}"><tr><td>data.table-package {data.table}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2> Enhanced data.frame </h2> <h3>Description</h3> <p><code>data.table</code> <em>inherits</em> from <code>data.frame</code>. It offers fast and memory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development. </p> <p>It is inspired by <code>A[B]</code> syntax in <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> where <code>A</code> is a matrix and <code>B</code> is a 2-column matrix. Since a <code>data.table</code> <em>is</em> a <code>data.frame</code>, it is compatible with <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> functions and packages that accept <em>only</em> <code>data.frame</code>s. </p> <p>Type <code>vignette(package="data.table")</code> to get started. The <a href="../doc/datatable-intro.html">Introduction to data.table</a> vignette introduces <code>data.table</code>'s <code>x[i, j, by]</code> syntax and is a good place to start. If you have read the vignettes and the help page below, please read the <a href="https://github.com/Rdatatable/data.table/wiki/Support">data.table support guide</a>. </p> <p>Please check the <a href="https://github.com/Rdatatable/data.table/wiki">homepage</a> for up to the minute live NEWS. </p> <p>Tip: one of the <em>quickest</em> ways to learn the features is to type <code>example(data.table)</code> and study the output at the prompt. </p> <h3>Usage</h3> <pre> data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE) ## S3 method for class 'data.table' x[i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch", NA), mult = "all", roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"), # default: FALSE allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE drop = NULL, on = NULL] </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>...</code></td> <td> <p> Just as <code>...</code> in <code><a href="../../base/html/data.frame.html">data.frame</a></code>. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.</p> </td></tr> <tr valign="top"><td><code>keep.rownames</code></td> <td> <p> If <code>...</code> is a <code>matrix</code> or <code>data.frame</code>, <code>TRUE</code> will retain the rownames of that object in a column named <code>rn</code>.</p> </td></tr> <tr valign="top"><td><code>check.names</code></td> <td> <p> Just as <code>check.names</code> in <code><a href="../../base/html/data.frame.html">data.frame</a></code>.</p> </td></tr> <tr valign="top"><td><code>key</code></td> <td> <p> Character vector of one or more column names which is passed to <code><a href="setkey.html">setkey</a></code>. It may be a single comma separated string such as <code>key="x,y,z"</code>, or a vector of names such as <code>key=c("x","y","z")</code>.</p> </td></tr> <tr valign="top"><td><code>stringsAsFactors</code></td> <td> <p>Logical (default is <code>FALSE</code>). Convert all <code>character</code> columns to <code>factor</code>s?</p> </td></tr> <tr valign="top"><td><code>x</code></td> <td> <p> A <code>data.table</code>.</p> </td></tr> <tr valign="top"><td><code>i</code></td> <td> <p> Integer, logical or character vector, single column numeric <code>matrix</code>, expression of column names, <code>list</code>, <code>data.frame</code> or <code>data.table</code>. </p> <p><code>integer</code> and <code>logical</code> vectors work the same way they do in <code><a href="../../base/html/Extract.data.frame.html">[.data.frame</a></code> except logical <code>NA</code>s are treated as FALSE. </p> <p><code>expression</code> is evaluated within the frame of the <code>data.table</code> (i.e. it sees column names as if they are variables) and can evaluate to any of the other types. </p> <p><code>character</code>, <code>list</code> and <code>data.frame</code> input to <code>i</code> is converted into a <code>data.table</code> internally using <code><a href="as.data.table.html">as.data.table</a></code>. </p> <p>If <code>i</code> is a <code>data.table</code>, the columns in <code>i</code> to be matched against <code>x</code> can be specified using one of these ways: </p> <ul> <li><p><code>on</code> argument (see below). It allows for both <code>equi-</code> and the newly implemented <code>non-equi</code> joins. </p> </li> <li><p>If not, <code>x</code> <em>must be keyed</em>. Key can be set using <code><a href="setkey.html">setkey</a></code>. If <code>i</code> is also keyed, then first <em>key</em> column of <code>i</code> is matched against first <em>key</em> column of <code>x</code>, second against second, etc.. </p> <p>If <code>i</code> is not keyed, then first column of <code>i</code> is matched against first <em>key</em> column of <code>x</code>, second column of <code>i</code> against second <em>key</em> column of <code>x</code>, etc... </p> <p>This is summarised in code as <code>min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i))</code>. </p> </li></ul> <p>Using <code>on=</code> is recommended (even during keyed joins) as it helps understand the code better and also allows for <em>non-equi</em> joins. </p> <p>When the binary operator <code>==</code> alone is used, an <em>equi</em> join is performed. In SQL terms, <code>x[i]</code> then performs a <em>right join</em> by default. <code>i</code> prefixed with <code>!</code> signals a <em>not-join</em> or <em>not-select</em>. </p> <p>Support for <em>non-equi</em> join was recently implemented, which allows for other binary operators <code>>=, >, <= and <</code>. </p> <p>See <a href="../doc/datatable-keys-fast-subset.html"><code>vignette("datatable-keys-fast-subset")</code></a> and <a href="../doc/datatable-secondary-indices-and-auto-indexing.html"><code>vignette("datatable-secondary-indices-and-auto-indexing")</code></a>. </p> <p><em>Advanced:</em> When <code>i</code> is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope. </p> </td></tr> <tr valign="top"><td><code>j</code></td> <td> <p>When <code>with=TRUE</code> (default), <code>j</code> is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just <em>select</em> columns in <code>j</code>, but also <code>compute</code> on them e.g., <code>x[, a]</code> and <code>x[, sum(a)]</code> returns <code>x$a</code> and <code>sum(x$a)</code> as a vector respectively. <code>x[, .(a, b)]</code> and <code>x[, .(sa=sum(a), sb=sum(b))]</code> returns a two column data.table each, the first simply <em>selecting</em> columns <code>a, b</code> and the second <em>computing</em> their sums. </p> <p>As long as <code>j</code> returns a <code>list</code>, each element of the list becomes a column in the resulting <code>data.table</code>. When the output of <code>j</code> is not a <code>list</code>, the output is returned as-is (e.g. <code>x[ , a]</code> returns the column vector <code>a</code>), unless <code>by</code> is used, in which case it is implicitly wrapped in <code>list</code> for convenience (e.g. <code>x[ , sum(a), by=b]</code> will create a column named <code>V1</code> with value <code>sum(a)</code> for each group). </p> <p>The expression '.()' is a <em>shorthand</em> alias to <code>list()</code>; they both mean the same. (An exception is made for the use of <code>.()</code> within a call to <code><a href="../../base/html/bquote.html">bquote</a></code>, where <code>.()</code> is left unchanged.) </p> <p>When <code>j</code> is a vector of column names or positions to select (as in <code>data.frame</code>). There is no need to use <code>with=FALSE</code> anymore. Note that <code>with=FALSE</code> is still necessary when using a logical vector with length <code>ncol(x)</code> to include/exclude columns. Note: if a logical vector with length <code>k < ncol(x)</code> is passed, it will be filled to length <code>ncol(x)</code> with <code>FALSE</code>, which is different from <code>data.frame</code>, where the vector is recycled. </p> <p><em>Advanced:</em> <code>j</code> also allows the use of special <em>read-only</em> symbols: <code><a href="special-symbols.html">.SD</a></code>, <code><a href="special-symbols.html">.N</a></code>, <code><a href="special-symbols.html">.I</a></code>, <code><a href="special-symbols.html">.GRP</a></code>, <code><a href="special-symbols.html">.BY</a></code>. See <code><a href="special-symbols.html">special-symbols</a></code> and the Examples below for more. </p> <p><em>Advanced:</em> When <code>i</code> is a <code>data.table</code>, the columns of <code>i</code> can be referred to in <code>j</code> by using the prefix <code>i.</code>, e.g., <code>X[Y, .(val, i.val)]</code>. Here <code>val</code> refers to <code>X</code>'s column and <code>i.val</code> <code>Y</code>'s. </p> <p><em>Advanced:</em> Columns of <code>x</code> can now be referred to using the prefix <code>x.</code> and is particularly useful during joining to refer to <code>x</code>'s <em>join</em> columns as they are otherwise masked by <code>i</code>'s. For example, <code>X[Y, .(x.a-i.a, b), on="a"]</code>. </p> <p>See <a href="../doc/datatable-intro.html"><code>vignette("datatable-intro")</code></a> and <code>example(data.table)</code>.</p> </td></tr> <tr valign="top"><td><code>by</code></td> <td> <p> Column names are seen as if they are variables (as in <code>j</code> when <code>with=TRUE</code>). The <code>data.table</code> is then grouped by the <code>by</code> and <code>j</code> is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. <code>by</code> accepts: </p> <ul> <li><p>A single unquoted column name: e.g., <code>DT[, .(sa=sum(a)), by=x]</code> </p> </li> <li><p>a <code>list()</code> of expressions of column names: e.g., <code>DT[, .(sa=sum(a)), by=.(x=x>0, y)]</code> </p> </li> <li><p>a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., <code>DT[, sum(a), by="x,y,z"]</code> </p> </li> <li><p>a character vector of column names: e.g., <code>DT[, sum(a), by=c("x", "y")]</code> </p> </li> <li><p>or of the form <code>startcol:endcol</code>: e.g., <code>DT[, sum(a), by=x:z]</code> </p> </li></ul> <p><em>Advanced:</em> When <code>i</code> is a <code>list</code> (or <code>data.frame</code> or <code>data.table</code>), <code>DT[i, j, by=.EACHI]</code> evaluates <code>j</code> for the groups in 'DT' that each row in <code>i</code> joins to. That is, you can join (in <code>i</code>) and aggregate (in <code>j</code>) simultaneously. We call this <em>grouping by each i</em>. See <a href="https://stackoverflow.com/a/27004566/559784">this StackOverflow answer</a> for a more detailed explanation until we <a href="https://github.com/Rdatatable/data.table/issues/944">roll out vignettes</a>. </p> <p><em>Advanced:</em> In the <code>X[Y, j]</code> form of grouping, the <code>j</code> expression sees variables in <code>X</code> first, then <code>Y</code>. We call this <em>join inherited scope</em>. If the variable is not in <code>X</code> or <code>Y</code> then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.</p> </td></tr> <tr valign="top"><td><code>keyby</code></td> <td> <p> Same as <code>by</code>, but with an additional <code>setkey()</code> run on the <code>by</code> columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.</p> </td></tr> <tr valign="top"><td><code>with</code></td> <td> <p> By default <code>with=TRUE</code> and <code>j</code> is evaluated within the frame of <code>x</code>; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix <code>..cols</code> to explicitly refer to '<code>cols</code> variable parent scope and not from your dataset. </p> <p>When <code>j</code> is a character vector of column names, a numeric vector of column positions to select or of the form <code>startcol:endcol</code>, and the value returned is always a <code>data.table</code>. <code>with=FALSE</code> is not necessary anymore to select columns dynamically. Note that <code>x[, cols]</code> is equivalent to <code>x[, ..cols]</code> and to <code>x[, cols, with=FALSE]</code> and to <code>x[, .SD, .SDcols=cols]</code>.</p> </td></tr> <tr valign="top"><td><code>nomatch</code></td> <td> <p> When a row in <code>i</code> has no match to <code>x</code>, <code>nomatch=NA</code> (default) means <code>NA</code> is returned. <code>NULL</code> (or <code>0</code> for backward compatibility) means no rows will be returned for that row of <code>i</code>. Use <code>options(datatable.nomatch=NULL)</code> to change the default value (used when <code>nomatch</code> is not supplied).</p> </td></tr> <tr valign="top"><td><code>mult</code></td> <td> <p> When <code>i</code> is a <code>list</code> (or <code>data.frame</code> or <code>data.table</code>) and <em>multiple</em> rows in <code>x</code> match to the row in <code>i</code>, <code>mult</code> controls which are returned: <code>"all"</code> (default), <code>"first"</code> or <code>"last"</code>.</p> </td></tr> <tr valign="top"><td><code>roll</code></td> <td> <p> When <code>i</code> is a <code>data.table</code> and its row matches to all but the last <code>x</code> join column, and its value in the last <code>i</code> join column falls in a gap (including after the last observation in <code>x</code> for that group), then: </p> <ul> <li><p><code>+Inf</code> (or <code>TRUE</code>) rolls the <em>prevailing</em> value in <code>x</code> forward. It is also known as last observation carried forward (LOCF). </p> </li> <li><p><code>-Inf</code> rolls backwards instead; i.e., next observation carried backward (NOCB). </p> </li> <li><p>finite positive or negative number limits how far values are carried forward or backward. </p> </li> <li><p>"nearest" rolls the nearest value instead. </p> </li></ul> <p>Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search. </p> <p>A common idiom is to select a contemporaneous regular time series (<code>dts</code>) across a set of identifiers (<code>ids</code>): <code>DT[CJ(ids,dts),roll=TRUE]</code> where <code>DT</code> has a 2-column key (id,date) and <code><a href="J.html">CJ</a></code> stands for <em>cross join</em>.</p> </td></tr> <tr valign="top"><td><code>rollends</code></td> <td> <p> A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well. </p> <ul> <li><p>If <code>rollends[2]=TRUE</code>, it will roll the last value forward. <code>TRUE</code> by default for LOCF and <code>FALSE</code> for NOCB rolls. </p> </li> <li><p>If <code>rollends[1]=TRUE</code>, it will roll the first value backward. <code>TRUE</code> by default for NOCB and <code>FALSE</code> for LOCF rolls. </p> </li></ul> <p>When <code>roll</code> is a finite number, that limit is also applied when rolling the ends.</p> </td></tr> <tr valign="top"><td><code>which</code></td> <td> <p><code>TRUE</code> returns the row numbers of <code>x</code> that <code>i</code> matches to. If <code>NA</code>, returns the row numbers of <code>i</code> that have no match in <code>x</code>. By default <code>FALSE</code> and the rows in <code>x</code> that match are returned.</p> </td></tr> <tr valign="top"><td><code>.SDcols</code></td> <td> <p> Specifies the columns of <code>x</code> to be included in the special symbol <code><a href="special-symbols.html">.SD</a></code> which stands for <code>Subset of data.table</code>. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., <code>DT[, lapply(.SD, sum), by="x,y", .SDcols=301:350]</code>. </p> <p>For convenient interactive use, the form <code>startcol:endcol</code> is also allowed (as in <code>by</code>), e.g., <code>DT[, lapply(.SD, sum), by=x:y, .SDcols=a:f]</code>. </p> <p>Inversion (column dropping instead of keeping) can be accomplished be prepending the argument with <code>!</code> or <code>-</code> (there's no difference between these), e.g. <code>.SDcols = !c('x', 'y')</code>. </p> <p>Finally, you can filter columns to include in <code>.SD</code> based on their <em>names</em> according to regular expressions via <code>.SDcols=patterns(regex1, regex2, ...)</code>. The included columns will be the <em>intersection</em> of the columns identified by each pattern; pattern unions can easily be specified with <code>|</code> in a regex. You can filter columns on <code>values</code> by passing a function, e.g. <code>.SDcols=<a href="../../base/html/numeric.html">is.numeric</a></code>. You can also invert a pattern as usual with <code>.SDcols=!patterns(...)</code> or <code>.SDcols=!is.numeric</code>. </p> </td></tr> <tr valign="top"><td><code>verbose</code></td> <td> <p><code>TRUE</code> turns on status and information messages to the console. Turn this on by default using <code>options(datatable.verbose=TRUE)</code>. The quantity and types of verbosity may be expanded in future. </p> </td></tr> <tr valign="top"><td><code>allow.cartesian</code></td> <td> <p><code>FALSE</code> prevents joins that would result in more than <code>nrow(x)+nrow(i)</code> rows. This is usually caused by duplicate values in <code>i</code>'s join columns, each of which join to the same group in 'x' over and over again: a <em>misspecified</em> join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in <code>data.table</code>: where every row in <code>i</code> joins to every row in <code>x</code> (a <code>nrow(x)*nrow(i)</code> row result). 'cartesian' is just meant in a 'large multiplicative' sense, so FALSE does not always prevent a traditional cartesian join. </p> </td></tr> <tr valign="top"><td><code>drop</code></td> <td> <p> Never used by <code>data.table</code>. Do not use. It needs to be here because <code>data.table</code> inherits from <code>data.frame</code>. See <a href="../doc/datatable-faq.html"><code>vignette("datatable-faq")</code></a>.</p> </td></tr> <tr valign="top"><td><code>on</code></td> <td> <p> Indicate which columns in <code>x</code> should be joined with which columns in <code>i</code> along with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set on <code>x</code> and <code>i</code>. When <code>.NATURAL</code> keyword provided then <em>natural join</em> is made (join on common columns). There are multiple ways of specifying the <code>on</code> argument: </p> <ul> <li><p>As an unnamed character vector, e.g., <code>X[Y, on=c("a", "b")]</code>, used when columns <code>a</code> and <code>b</code> are common to both <code>X</code> and <code>Y</code>. </p> </li> <li><p><em>Foreign key joins</em>: As a <em>named</em> character vector when the join columns have different names in <code>X</code> and <code>Y</code>. For example, <code>X[Y, on=c(x1="y1", x2="y2")]</code> joins <code>X</code> and <code>Y</code> by matching columns <code>x1</code> and <code>x2</code> in <code>X</code> with columns <code>y1</code> and <code>y2</code> in <code>Y</code>, respectively. </p> <p>From v1.9.8, you can also express foreign key joins using the binary operator <code>==</code>, e.g. <code>X[Y, on=c("x1==y1", "x2==y2")]</code>. </p> <p>NB: shorthand like <code>X[Y, on=c("a", V2="b")]</code> is also possible if, e.g., column <code>"a"</code> is common between the two tables. </p> </li> <li><p>For convenience during interactive scenarios, it is also possible to use <code>.()</code> syntax as <code>X[Y, on=.(a, b)]</code>. </p> </li> <li><p>From v1.9.8, (non-equi) joins using binary operators <code>>=, >, <=, <</code> are also possible, e.g., <code>X[Y, on=c("x>=a", "y<=b")]</code>, or for interactive use as <code>X[Y, on=.(x>=a, y<=b)]</code>. </p> </li></ul> <p>See examples as well as <a href="../doc/datatable-secondary-indices-and-auto-indexing.html"><code>vignette("datatable-secondary-indices-and-auto-indexing")</code></a>. </p> </td></tr> </table> <h3>Details</h3> <p><code>data.table</code> builds on base <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> functionality to reduce 2 types of time:<br /> </p> <ol> <li><p>programming time (easier to write, read, debug and maintain), and </p> </li> <li><p>compute time (fast and memory efficient). </p> </li></ol> <p>The general form of data.table syntax is:<br /> </p> <pre> DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows? </pre> <p>The way to read this out loud is: "Take <code>DT</code>, subset rows by <code>i</code>, <em>then</em> compute <code>j</code> grouped by <code>by</code>. Here are some basic usage examples expanding on this definition. See the vignette (and examples) for working examples. </p> <pre> X[, a] # return col 'a' from X as vector. If not found, search in parent frame. X[, .(a)] # same as above, but return as a data.table. X[, sum(a)] # return sum(a) as a vector (with same scoping rules as above) X[, .(sum(a)), by=c] # get sum(a) grouped by 'c'. X[, sum(a), by=c] # same as above, .() can be omitted in j and by on single expression for convenience X[, sum(a), by=c:f] # get sum(a) grouped by all columns in between 'c' and 'f' (both inclusive) X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that result by the grouping column 'b' X[, sum(a), by=b][order(b)] # same order as above, but by chaining compound expressions X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those rows, get sum(a) grouped by 'c' X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select columns 'X$a' and 'X$b' for those rows X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select 'X$a' and 'Y$a' (=i.a) X[Y, sum(a*i.a), on="c" by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on matching rows in 'X$c' X[, plot(a, b), by=c] # j accepts any expression, generates plot for each group and returns no data # see ?assign to add/update/delete columns by reference using the same consistent interface </pre> <p>A <code>data.table</code> is a <code>list</code> of vectors, just like a <code>data.frame</code>. However : </p> <ol> <li><p> it never has or uses rownames. Rownames based indexing can be done by setting a <em>key</em> of one or more columns or done <em>ad-hoc</em> using the <code>on</code> argument (now preferred). </p> </li> <li><p> it has enhanced functionality in <code>[.data.table</code> for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all. </p> </li></ol> <p>See the <code>see also</code> section for the several other <em>methods</em> that are available for operating on data.tables efficiently. </p> <h3>Note</h3> <p> If <code>keep.rownames</code> or <code>check.names</code> are supplied they must be written in full because <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> does not allow partial argument names after '<code>...</code>'. For example, <code>data.table(DF, keep=TRUE)</code> will create a column called <code>"keep"</code> containing <code>TRUE</code> and this is correct behaviour; <code>data.table(DF, keep.rownames=TRUE)</code> was intended. </p> <p><code>POSIXlt</code> is not supported as a column type because it uses 40 bytes to store a single datetime. They are implicitly converted to <code>POSIXct</code> type with <em>warning</em>. You may also be interested in <code><a href="IDateTime.html">IDateTime</a></code> instead; it has methods to convert to and from <code>POSIXlt</code>. </p> <h3>References</h3> <p><a href="https://github.com/Rdatatable/data.table/wiki">https://github.com/Rdatatable/data.table/wiki</a> (<code>data.table</code> homepage)<br /> <a href="https://en.wikipedia.org/wiki/Binary_search">https://en.wikipedia.org/wiki/Binary_search</a> </p> <h3>See Also</h3> <p><code><a href="special-symbols.html">special-symbols</a></code>, <code><a href="../../base/html/data.frame.html">data.frame</a></code>, <code><a href="../../base/html/Extract.data.frame.html">[.data.frame</a></code>, <code><a href="as.data.table.html">as.data.table</a></code>, <code><a href="setkey.html">setkey</a></code>, <code><a href="setorder.html">setorder</a></code>, <code><a href="setDT.html">setDT</a></code>, <code><a href="setDF.html">setDF</a></code>, <code><a href="J.html">J</a></code>, <code><a href="J.html">SJ</a></code>, <code><a href="J.html">CJ</a></code>, <code><a href="merge.html">merge.data.table</a></code>, <code><a href="tables.html">tables</a></code>, <code><a href="test.data.table.html">test.data.table</a></code>, <code><a href="IDateTime.html">IDateTime</a></code>, <code><a href="duplicated.html">unique.data.table</a></code>, <code><a href="copy.html">copy</a></code>, <code><a href="assign.html">:=</a></code>, <code><a href="truelength.html">setalloccol</a></code>, <code><a href="truelength.html">truelength</a></code>, <code><a href="rbindlist.html">rbindlist</a></code>, <code><a href="setNumericRounding.html">setNumericRounding</a></code>, <code><a href="datatable-optimize.html">datatable-optimize</a></code>, <code><a href="setops.html">fsetdiff</a></code>, <code><a href="setops.html">funion</a></code>, <code><a href="setops.html">fintersect</a></code>, <code><a href="setops.html">fsetequal</a></code>, <code><a href="duplicated.html">anyDuplicated</a></code>, <code><a href="duplicated.html">uniqueN</a></code>, <code><a href="rowid.html">rowid</a></code>, <code><a href="rleid.html">rleid</a></code>, <code><a href="na.omit.data.table.html">na.omit</a></code>, <code><a href="frank.html">frank</a></code> </p> <h3>Examples</h3> <pre> ## Not run: example(data.table) # to run these examples yourself ## End(Not run) DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT), dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() # basic row subset operations DT[2] # 2nd row DT[3:2] # 3rd and 2nd row DT[order(x)] # no need for order(DT$x) DT[order(x), ] # same as above. The ',' is optional DT[y>2] # all rows where DT$y > 2 DT[y>2 & v>5] # compound logical expressions DT[!2:4] # all rows other than 2:4 DT[-(2:4)] # same # select|compute columns data.table way DT[, v] # v column (as vector) DT[, list(v)] # v column (as data.table) DT[, .(v)] # same as above, .() is a shorthand alias to list() DT[, sum(v)] # sum of column v, returned as vector DT[, .(sum(v))] # same, but return data.table (column autonamed V1) DT[, .(sv=sum(v))] # same, but column named "sv" DT[, .(v, v*2)] # return two column data.table, v and v*2 # subset rows and select|compute data.table way DT[2:3, sum(v)] # sum(v) over rows 2 and 3, return vector DT[2:3, .(sum(v))] # same, but return data.table with column V1 DT[2:3, .(sv=sum(v))] # same, but return data.table with column sv DT[2:5, cat(v, "\n")] # just for j's side effect # select columns the data.frame way DT[, 2] # 2nd column, returns a data.table always colNum = 2 # to refer vars in `j` from the outside of data use `..` prefix DT[, ..colNum] # same, equivalent to DT[, .SD, .SDcols=colNum] DT[["v"]] # same as DT[, v] but much faster # grouping operations - j and by DT[, sum(v), by=x] # ad hoc by, order of groups preserved in result DT[, sum(v), keyby=x] # same, but order the result on by cols DT[, sum(v), by=x][order(x)] # same but by chaining expressions together # fast ad hoc row subsets (subsets as joins) DT["a", on="x"] # same as x == "a" but uses binary search (fast) DT["a", on=.(x)] # same, for convenience, no need to quote every column DT[.("a"), on="x"] # same DT[x=="a"] # same, single "==" internally optimised to use binary search (fast) DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast) DT[.("b", 3), on=.(x, y)] # same, but using on=.() DT[.("b", 1:2), on=c("x", "y")] # no match returns NA DT[.("b", 1:2), on=.(x, y), nomatch=NULL] # no match row is not returned DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, nomatch row gets rolled by next row DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y) # all together now DT[x!="a", sum(v), by=x] # get sum(v) by "x" for each i != "a" DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.() # joins as subsets X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)) X DT[X, on="x"] # right join X[DT, on="x"] # left join DT[X, on="x", nomatch=NULL] # inner join DT[!X, on="x"] # not join DT[X, on=c(y="v")] # join using column "y" of DT with column "v" of X DT[X, on="y==v"] # same as above (v1.9.8+) DT[X, on=.(y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, on="y<=foo"] # same as above DT[X, on=c("y<=foo")] # same as above DT[X, on=.(y>=foo)] # NEW non-equi join (v1.9.8+) DT[X, on=.(x, y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # Select x's join columns as well DT[X, on="x", mult="first"] # first row of each group DT[X, on="x", mult="last"] # last row of each group DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+) # setting keys kDT = copy(DT) # (deep) copy DT to kDT to work with it. setkey(kDT,x) # set a 1-column key. No quotes, for convenience. setkeyv(kDT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(kDT,v) # same # key(kDT)<-"x" # copies whole table, please use set* functions instead haskey(kDT) # TRUE key(kDT) # "x" # fast *keyed* subsets kDT["a"] # subset-as-join on *key* column 'x' kDT["a", on="x"] # same, being explicit using 'on=' (preferred) # all together kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a" # multi-column key setkey(kDT,x,y) # 2-column key setkeyv(kDT,c("x","y")) # same # fast *keyed* subsets on multi-column key kDT["a"] # join to 1st column of key kDT["a", on="x"] # on= is optional, but is preferred kDT[.("a")] # same, .() is an alias for list() kDT[list("a")] # same kDT[.("a", 3)] # join to 2 columns kDT[.("a", 3:6)] # join 4 rows (2 missing) kDT[.("a", 3:6), nomatch=NULL] # remove missing kDT[.("a", 3:6), roll=TRUE] # locf rolling join kDT[.("a", 3:6), roll=Inf] # same as above kDT[.("a", 3:6), roll=-Inf] # nocb rolling join kDT[!.("a")] # not join kDT[!"a"] # same # more on special symbols, see also ?"special-symbols" DT[.N] # last row DT[, .N] # total number of rows in DT DT[, .N, by=x] # number of rows in each group DT[, .SD, .SDcols=x:y] # select columns 'x' through 'y' DT[ , .SD, .SDcols = !x:y] # drop columns 'x' through 'y' DT[ , .SD, .SDcols = patterns('^[xv]')] # select columns matching '^x' or '^v' DT[, .SD[1]] # first row of all columns DT[, .SD[1], by=x] # first row of 'y' and 'v' for each group in 'x' DT[, c(.N, lapply(.SD, sum)), by=x] # get rows *and* sum columns 'v' and 'y' by group DT[, .I[1], by=x] # row number in DT corresponding to each group DT[, grp := .GRP, by=x] # add a group counter column DT[ , dput(.BY), by=.(x,y)] # .BY is a list of singletons for each group X[, DT[.BY, y, on="x"], by=x] # join within each group DT[, { # write each group to a different file fwrite(.SD, file.path(tempdir(), paste0('x=', .BY$x, '.csv'))) }, by=x] dir(tempdir()) # add/update/delete by reference (see ?assign) print(DT[, z:=42L]) # add new column by reference print(DT[, z:=NULL]) # remove column by reference print(DT["a", v:=42L, on="x"]) # subassign to existing v column by reference print(DT["b", v2:=84L, on="x"]) # subassign to new column by reference (NA padded) DT[, m:=mean(v), by=x][] # add new column by reference by group # NB: postfix [] is shortcut to print() # advanced usage DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1) DT[, sum(v), by=.(y%%2)] # expressions in by DT[, sum(v), by=.(bool = y%%2)] # same, using a named list to change by column name DT[, .SD[2], by=x] # get 2nd row of each group DT[, tail(.SD,2), by=x] # last 2 rows of each group DT[, lapply(.SD, sum), by=x] # sum of all (other) columns for each group DT[, .SD[which.min(v)], by=x] # nested query by group DT[, list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=.(x, y%%2)] # by 2 expressions DT[, .(a = .(a), b = .(b)), by=x] # list columns DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations DT[, sum(v), by=x][V1<20] # compound query DT[, sum(v), by=x][order(-V1)] # ordering results DT[, c(.N, lapply(.SD,sum)), by=x] # get number of observations and sum per group DT[, {tmp <- mean(y); .(a = a-tmp, b = b-tmp) }, by=x] # anonymous lambda in 'j', j accepts any valid # expression. TO REMEMBER: every element of # the list becomes a column in result. pdf("new.pdf") DT[, plot(a,b), by=x] # can also plot in 'j' dev.off() # using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v' DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b] # Support guide and links: # https://github.com/Rdatatable/data.table/wiki/Support ## Not run: if (interactive()) { vignette(package="data.table") # 9 vignettes test.data.table() # 6,000 tests # keep up to date with latest stable version on CRAN update.packages() # get the latest devel version that has passed all tests update_dev_pkg() # read more at: # https://github.com/Rdatatable/data.table/wiki/Installation } ## End(Not run)</pre> <hr /><div style="text-align: center;">[Package <em>data.table</em> version 1.14.4 <a href="00Index.html">Index</a>]</div> </body></html>