EVOLUTION-MANAGER
Edit File: duplicated.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Determine Duplicate Rows</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for duplicated {data.table}"><tr><td>duplicated {data.table}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2> Determine Duplicate Rows </h2> <h3>Description</h3> <p><code>duplicated</code> returns a logical vector indicating which rows of a <code>data.table</code> are duplicates of a row with smaller subscripts. </p> <p><code>unique</code> returns a <code>data.table</code> with duplicated rows removed, by columns specified in <code>by</code> argument. When no <code>by</code> then duplicated rows by all columns are removed. </p> <p><code>anyDuplicated</code> returns the <em>index</em> <code>i</code> of the first duplicated entry if there is one, and 0 otherwise. </p> <p><code>uniqueN</code> is equivalent to <code>length(unique(x))</code> when x is an <code>atomic vector</code>, and <code>nrow(unique(x))</code> when x is a <code>data.frame</code> or <code>data.table</code>. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient. </p> <h3>Usage</h3> <pre> ## S3 method for class 'data.table' duplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) ## S3 method for class 'data.table' unique(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) ## S3 method for class 'data.table' anyDuplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p> A data.table. <code>uniqueN</code> accepts atomic vectors and data.frames as well.</p> </td></tr> <tr valign="top"><td><code>...</code></td> <td> <p> Not used at this time. </p> </td></tr> <tr valign="top"><td><code>incomparables</code></td> <td> <p> Not used. Here for S3 method consistency. </p> </td></tr> <tr valign="top"><td><code>fromLast</code></td> <td> <p> logical indicating if duplication should be considered from the reverse side, i.e., the last (or rightmost) of identical elements would correspond to <code>duplicated = FALSE</code>.</p> </td></tr> <tr valign="top"><td><code>by</code></td> <td> <p><code>character</code> or <code>integer</code> vector indicating which combinations of columns from <code>x</code> to use for uniqueness checks. By default all columns are being used. That was changed recently for consistency to data.frame methods. In version <code>< 1.9.8</code> default was <code>key(x)</code>.</p> </td></tr> <tr valign="top"><td><code>na.rm</code></td> <td> <p>Logical (default is <code>FALSE</code>). Should missing values (including <code>NaN</code>) be removed?</p> </td></tr> </table> <h3>Details</h3> <p>Because data.tables are usually sorted by key, tests for duplication are especially quick when only the keyed columns are considered. Unlike <code><a href="../../base/html/unique.html">unique.data.frame</a></code>, <code>paste</code> is not used to ensure equality of floating point data. It is instead accomplished directly and is therefore quite fast. data.table provides <code><a href="setNumericRounding.html">setNumericRounding</a></code> to handle cases where limitations in floating point representation is undesirable. </p> <p><code>v1.9.4</code> introduces <code>anyDuplicated</code> method for data.tables and is similar to base in functionality. It also implements the logical argument <code>fromLast</code> for all three functions, with default value <code>FALSE</code>. </p> <h3>Value</h3> <p><code>duplicated</code> returns a logical vector of length <code>nrow(x)</code> indicating which rows are duplicates. </p> <p><code>unique</code> returns a data table with duplicated rows removed. </p> <p><code>anyDuplicated</code> returns a integer value with the index of first duplicate. If none exists, 0L is returned. </p> <p><code>uniqueN</code> returns the number of unique elements in the vector, <code>data.frame</code> or <code>data.table</code>. </p> <h3>See Also</h3> <p><code><a href="setNumericRounding.html">setNumericRounding</a></code>, <code><a href="data.table.html">data.table</a></code>, <code><a href="duplicated.html">duplicated</a></code>, <code><a href="duplicated.html">unique</a></code>, <code><a href="all.equal.data.table.html">all.equal</a></code>, <code><a href="setops.html">fsetdiff</a></code>, <code><a href="setops.html">funion</a></code>, <code><a href="setops.html">fintersect</a></code>, <code><a href="setops.html">fsetequal</a></code> </p> <h3>Examples</h3> <pre> DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B") duplicated(DT) unique(DT) duplicated(DT, by="B") unique(DT, by="B") duplicated(DT, by=c("A", "C")) unique(DT, by=c("A", "C")) DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L)) # no key unique(DT) # rows 1 and 2 (row 3 is a duplicate of row 1) DT = data.table(a=c(3.142, 4.2, 4.2, 3.142, 1.223, 1.223), b=rep(1,6)) unique(DT) # rows 1,2 and 5 DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10)) # example from ?all.equal length(unique(DT$a)) # 10 strictly unique floating point values all.equal(DT$a,rep(1,10)) # TRUE, all within tolerance of 1.0 DT[,which.min(a)] # row 10, the strictly smallest floating point value identical(unique(DT),DT[1]) # TRUE, stable within tolerance identical(unique(DT),DT[10]) # FALSE # fromLast=TRUE DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B") duplicated(DT, by="B", fromLast=TRUE) unique(DT, by="B", fromLast=TRUE) # anyDuplicated anyDuplicated(DT, by=c("A", "B")) # 3L any(duplicated(DT, by=c("A", "B"))) # TRUE # uniqueN, unique rows on key columns uniqueN(DT, by = key(DT)) # uniqueN, unique rows on all columns uniqueN(DT) # uniqueN while grouped by "A" DT[, .(uN=uniqueN(.SD)), by=A] # uniqueN's na.rm=TRUE x = sample(c(NA, NaN, runif(3)), 10, TRUE) uniqueN(x, na.rm = FALSE) # 5, default uniqueN(x, na.rm=TRUE) # 3 </pre> <hr /><div style="text-align: center;">[Package <em>data.table</em> version 1.14.4 <a href="00Index.html">Index</a>]</div> </body></html>