EVOLUTION-MANAGER
Edit File: foverlaps.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Fast overlap joins</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for foverlaps {data.table}"><tr><td>foverlaps {data.table}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Fast overlap joins</h2> <h3>Description</h3> <p>A <em>fast</em> binary-search based <em>overlap join</em> of two <code>data.table</code>s. This is very much inspired by <code>findOverlaps</code> function from the Bioconductor package <code>IRanges</code> (see link below under <code>See Also</code>). </p> <p>Usually, <code>x</code> is a very large data.table with small interval ranges, and <code>y</code> is much smaller <em>keyed</em> <code>data.table</code> with relatively larger interval spans. For a usage in <code>genomics</code>, see the examples section. </p> <p>NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed. </p> <h3>Usage</h3> <pre> foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.y = key(y), maxgap = 0L, minoverlap = 1L, type = c("any", "within", "start", "end", "equal"), mult = c("all", "first", "last"), nomatch = getOption("datatable.nomatch", NA), which = FALSE, verbose = getOption("datatable.verbose")) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x, y</code></td> <td> <p><code>data.table</code>s. <code>y</code> needs to be keyed, but not necessarily <code>x</code>. See examples. </p> </td></tr> <tr valign="top"><td><code>by.x, by.y</code></td> <td> <p>A vector of column names (or numbers) to compute the overlap joins. The last two columns in both <code>by.x</code> and <code>by.y</code> should each correspond to the <code>start</code> and <code>end</code> interval columns in <code>x</code> and <code>y</code> respectively. And the <code>start</code> column should always be <= <code>end</code> column. If <code>x</code> is keyed, <code>by.x</code> is equal to <code>key(x)</code>, else <code>key(y)</code>. <code>by.y</code> defaults to <code>key(y)</code>. </p> </td></tr> <tr valign="top"><td><code>maxgap</code></td> <td> <p>It should be a non-negative integer value, >= 0. Default is 0 (no gap). For intervals <code>[a,b]</code> and <code>[c,d]</code>, where <code>a<=b</code> and <code>c<=d</code>, when <code>c > b</code> or <code>d < a</code>, the two intervals don't overlap. If the gap between these two intervals is <code><= maxgap</code>, these two intervals are considered as overlapping. Note: This is not yet implemented.</p> </td></tr> <tr valign="top"><td><code>minoverlap</code></td> <td> <p> It should be a positive integer value, > 0. Default is 1. For intervals <code>[a,b]</code> and <code>[c,d]</code>, where <code>a<=b</code> and <code>c<=d</code>, when <code>c<=b</code> and <code>d>=a</code>, the two intervals overlap. If the length of overlap between these two intervals is <code>>= minoverlap</code>, then these two intervals are considered to be overlapping. Note: This is not yet implemented.</p> </td></tr> <tr valign="top"><td><code>type</code></td> <td> <p> Default value is <code>any</code>. Allowed values are <code>any</code>, <code>within</code>, <code>start</code>, <code>end</code> and <code>equal</code>. </p> <p>The types shown here are identical in functionality to the function <code>findOverlaps</code> in the bioconductor package <code>IRanges</code>. Let <code>[a,b]</code> and <code>[c,d]</code> be intervals in <code>x</code> and <code>y</code> with <code>a<=b</code> and <code>c<=d</code>. For <code>type="start"</code>, the intervals overlap iff <code>a == c</code>. For <code>type="end"</code>, the intervals overlap iff <code>b == d</code>. For <code>type="within"</code>, the intervals overlap iff <code>a>=c and b<=d</code>. For <code>type="equal"</code>, the intervals overlap iff <code>a==c and b==d</code>. For <code>type="any"</code>, as long as <code>c<=b and d>=a</code>, they overlap. In addition to these requirements, they also have to satisfy the <code>minoverlap</code> argument as explained above. </p> <p>NB: <code>maxgap</code> argument, when > 0, is to be interpreted according to the type of the overlap. This will be updated once <code>maxgap</code> is implemented.</p> </td></tr> <tr valign="top"><td><code>mult</code></td> <td> <p> When multiple rows in <code>y</code> match to the row in <code>x</code>, <code>mult=.</code> controls which values are returned - <code>"all"</code> (default), <code>"first"</code> or <code>"last"</code>.</p> </td></tr> <tr valign="top"><td><code>nomatch</code></td> <td> <p> When a row (with interval say, <code>[a,b]</code>) in <code>x</code> has no match in <code>y</code>, <code>nomatch=NA</code> (default) means <code>NA</code> is returned for <code>y</code>'s non-<code>by.y</code> columns for that row of <code>x</code>. <code>nomatch=NULL</code> (or <code>0</code> for backward compatibility) means no rows will be returned for that row of <code>x</code>. Use <code>options(datatable.nomatch=NULL)</code> to change the default value (used when <code>nomatch</code> is not supplied).</p> </td></tr> <tr valign="top"><td><code>which</code></td> <td> <p> When <code>TRUE</code>, if <code>mult="all"</code> returns a two column <code>data.table</code> with the first column corresponding to <code>x</code>'s row number and the second corresponding to <code>y</code>'s. when <code>nomatch=NA</code>, no matches return <code>NA</code> for <code>y</code>, and if <code>nomatch=NULL</code>, those rows where no match is found will be skipped; if <code>mult="first" or "last"</code>, a vector of length equal to the number of rows in <code>x</code> is returned, with no-match entries filled with <code>NA</code> or <code>0</code> corresponding to the <code>nomatch</code> argument. Default is <code>FALSE</code>, which returns a join with the rows in <code>y</code>.</p> </td></tr> <tr valign="top"><td><code>verbose</code></td> <td> <p><code>TRUE</code> turns on status and information messages to the console. Turn this on by default using <code>options(datatable.verbose=TRUE)</code>. The quantity and types of verbosity may be expanded in future.</p> </td></tr> </table> <h3>Details</h3> <p>Very briefly, <code>foverlaps()</code> collapses the two-column interval in <code>y</code> to one-column of <em>unique</em> values to generate a <code>lookup</code> table, and then performs the join depending on the type of <code>overlap</code>, using the already available <code>binary search</code> feature of <code>data.table</code>. The time (and space) required to generate the <code>lookup</code> is therefore proportional to the number of unique values present in the interval columns of <code>y</code> when combined together. </p> <p>Overlap joins takes advantage of the fact that <code>y</code> is sorted to speed-up finding overlaps. Therefore <code>y</code> has to be keyed (see <code>?setkey</code>) prior to running <code>foverlaps()</code>. A key on <code>x</code> is not necessary, although it <em>might</em> speed things further. The columns in <code>by.x</code> argument should correspond to the columns specified in <code>by.y</code>. The last two columns should be the <em>interval</em> columns in both <code>by.x</code> and <code>by.y</code>. The first interval column in <code>by.x</code> should always be <= the second interval column in <code>by.x</code>, and likewise for <code>by.y</code>. The <code><a href="../../base/html/mode.html">storage.mode</a></code> of the interval columns must be either <code>double</code> or <code>integer</code>. It therefore works with <code>bit64::integer64</code> type as well. </p> <p>The <code>lookup</code> generation step could be quite time consuming if the number of unique values in <code>y</code> are too large (ex: in the order of tens of millions). There might be improvements possible by constructing lookup using RLE, which is a pending feature request. However most scenarios will not have too many unique values for <code>y</code>. </p> <h3>Value</h3> <p>A new <code>data.table</code> by joining over the interval columns (along with other additional identifier columns) specified in <code>by.x</code> and <code>by.y</code>. </p> <p>NB: When <code>which=TRUE</code>: <code>a)</code> <code>mult="first" or "last"</code> returns a <code>vector</code> of matching row numbers in <code>y</code>, and <code>b)</code> when <code>mult="all"</code> returns a data.table with two columns with the first containing row numbers of <code>x</code> and the second column with corresponding row numbers of <code>y</code>. </p> <p><code>nomatch=NA or 0</code> also influences whether non-matching rows are returned or not, as explained above. </p> <h3>See Also</h3> <p><code><a href="data.table.html">data.table</a></code>, <a href="https://www.bioconductor.org/packages/release/bioc/html/IRanges.html">https://www.bioconductor.org/packages/release/bioc/html/IRanges.html</a>, <code><a href="setNumericRounding.html">setNumericRounding</a></code> </p> <h3>Examples</h3> <pre> require(data.table) ## simple example: x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10) y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3) setkey(y, start, end) foverlaps(x, y, type="any", which=TRUE) ## return overlap indices foverlaps(x, y, type="any") ## return overlap join foverlaps(x, y, type="any", mult="first") ## returns only first match foverlaps(x, y, type="within") ## matches iff 'x' is within 'y' ## with extra identifiers (ex: in genomics) x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, type="any", which=TRUE) foverlaps(x, y, type="any") foverlaps(x, y, type="any", nomatch=NULL) foverlaps(x, y, type="within", which=TRUE) foverlaps(x, y, type="within") foverlaps(x, y, type="start") ## x and y have different column names - specify by.x x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, by.x=c("seq", "start", "end"), type="any", which=TRUE) </pre> <hr /><div style="text-align: center;">[Package <em>data.table</em> version 1.14.4 <a href="00Index.html">Index</a>]</div> </body></html>