EVOLUTION-MANAGER

Edit File: silhouette.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Compute or Extract Silhouette Information from Clustering</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
</head><body>

<table width="100%" summary="page for silhouette {cluster}"><tr><td>silhouette {cluster}</td><td style="text-align: right;">R Documentation</td></tr></table>

<h2>Compute or Extract Silhouette Information from Clustering</h2>

<h3>Description</h3>

Compute silhouette information according to a given clustering in
k clusters.

<h3>Usage</h3>

<pre>
silhouette(x, ...)
## Default S3 method:
 silhouette(x, dist, dmatrix, ...)
## S3 method for class 'partition'
silhouette(x, ...)
## S3 method for class 'clara'
silhouette(x, full = FALSE, ...)

sortSilhouette(object, ...)
## S3 method for class 'silhouette'
summary(object, FUN = mean, ...)
## S3 method for class 'silhouette'
plot(x, nmax.lab = 40, max.strlen = 5,
 main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
 col = "gray", do.col.sort = length(col) &gt; 1, border = 0,
 cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)
</pre>

<h3>Arguments</h3>

<table summary="R argblock">
<tr valign="top"><td><code>x</code></td>
<td>
an object of appropriate class; for the <code>default</code>
method an integer vector with k different integer cluster
codes or a list with such an <code>x$clustering</code>
component. Note that silhouette statistics are only defined if
2 &lt;= k &lt;= n-1.
</td></tr>
<tr valign="top"><td><code>dist</code></td>
<td>
a dissimilarity object inheriting from class
<code><a href="../../stats/html/dist.html">dist</a></code> or coercible to one. If not specified,
<code>dmatrix</code> must be.
</td></tr>
<tr valign="top"><td><code>dmatrix</code></td>
<td>
a symmetric dissimilarity matrix (n x n),
specified instead of <code>dist</code>, which can be more efficient.
</td></tr>
<tr valign="top"><td><code>full</code></td>
<td>
logical specifying if a full silhouette should be
computed for <code><a href="clara.html">clara</a></code> object. Note that this requires
O(n^2) memory, since the full dissimilarity (see
<code><a href="daisy.html">daisy</a></code>) is needed internally.
</td></tr>
<tr valign="top"><td><code>object</code></td>
<td>
an object of class <code>silhouette</code>.
</td></tr>
<tr valign="top"><td><code>...</code></td>
<td>
further arguments passed to and from methods.
</td></tr>
<tr valign="top"><td><code>FUN</code></td>
<td>
function used to summarize silhouette widths.
</td></tr>
<tr valign="top"><td><code>nmax.lab</code></td>
<td>
integer indicating the number of labels which is
considered too large for single-name labeling the silhouette plot.
</td></tr>
<tr valign="top"><td><code>max.strlen</code></td>
<td>
positive integer giving the length to which
strings are truncated in silhouette plot labeling.
</td></tr>
<tr valign="top"><td><code>main, sub, xlab</code></td>
<td>
arguments to <code><a href="../../graphics/html/title.html">title</a></code>; have a
sensible non-NULL default here.
</td></tr>
<tr valign="top"><td><code>col, border, cex.names</code></td>
<td>
arguments passed
<code><a href="../../graphics/html/barplot.html">barplot</a>()</code>; note that the default used to be <code>col
 = heat.colors(n), border = par("fg")</code> instead. 
<code>col</code> can also be a color vector of length k for
clusterwise coloring, see also <code>do.col.sort</code>:

</td></tr>
<tr valign="top"><td><code>do.col.sort</code></td>
<td>
logical indicating if the colors <code>col</code> should
be sorted &ldquo;along&rdquo; the silhouette; this is useful for casewise or
clusterwise coloring.
</td></tr>
<tr valign="top"><td><code>do.n.k</code></td>
<td>
logical indicating if n and k &ldquo;title text&rdquo;
should be written.
</td></tr>
<tr valign="top"><td><code>do.clus.stat</code></td>
<td>
logical indicating if cluster size and averages
should be written right to the silhouettes.
</td></tr>
</table>

<h3>Details</h3>

For each observation i, the silhouette width s(i) is
defined as follows: 
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the only observation in
its cluster, s(i) := 0 without further calculations).
For all other clusters C, put d(i,C) = average
dissimilarity of i to all observations of C. The smallest of these
d(i,C) is b(i) := \min_C d(i,C),
and can be seen as the dissimilarity between i and its &ldquo;neighbor&rdquo;
cluster, i.e., the nearest one to which it does not belong.
Finally, 

 s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).

<code>silhouette.default()</code> is now based on C code donated by Romain
Francois (the R version being still available as
<code>cluster:::silhouette.default.R</code>).

Observations with a large s(i) (almost 1) are very well
clustered, a small s(i) (around 0) means that the observation
lies between two clusters, and observations with a negative
s(i) are probably placed in the wrong cluster.

<h3>Value</h3>

<code>silhouette()</code> returns an object, <code>sil</code>, of class
<code>silhouette</code> which is an n x 3 matrix with
attributes. For each observation i, <code>sil[i,]</code> contains the
cluster to which i belongs as well as the neighbor cluster of i (the
cluster, not containing i, for which the average dissimilarity between its
observations and i is minimal), and the silhouette width s(i) of
the observation. The <code><a href="../../base/html/colnames.html">colnames</a></code> correspondingly are
<code>c("cluster", "neighbor", "sil_width")</code>.

<code>summary(sil)</code> returns an object of class
<code>summary.silhouette</code>, a list with components

<dl>
<dt><code>si.summary</code>:</dt><dd>numerical <code><a href="../../base/html/summary.html">summary</a></code> of the
individual silhouette widths s(i).
</dd>
<dt><code>clus.avg.widths</code>:</dt><dd>numeric (rank 1) array of clusterwise
means of silhouette widths where <code>mean = FUN</code> is used.
</dd>
<dt><code>avg.width</code>:</dt><dd>the total mean <code>FUN(s)</code> where
<code>s</code> are the individual silhouette widths.
</dd>
<dt><code>clus.sizes</code>:</dt><dd><code><a href="../../base/html/table.html">table</a></code> of the k cluster sizes.
</dd>
<dt><code>call</code>:</dt><dd>if available, the <code><a href="../../base/html/call.html">call</a></code> creating <code>sil</code>.
</dd>
<dt><code>Ordered</code>:</dt><dd>logical identical to <code>attr(sil, "Ordered")</code>,
see below.
</dd>
</dl>

<code>sortSilhouette(sil)</code> orders the rows of <code>sil</code> as in the
silhouette plot, by cluster (increasingly) and decreasing silhouette
width s(i).
 
<code>attr(sil, "Ordered")</code> is a logical indicating if <code>sil</code> is
ordered as by <code>sortSilhouette()</code>. In that case,
<code>rownames(sil)</code> will contain case labels or numbers, and 
<code>attr(sil, "iOrd")</code> the ordering index vector.

While <code>silhouette()</code> is intrinsic to the
<code><a href="partition.object.html">partition</a></code> clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from <code>silhouette.default()</code> with
<code><a href="../../stats/html/cutree.html">cutree</a>()</code> and distance as input.

By default, for <code><a href="clara.html">clara</a>()</code> partitions, the silhouette is
just for the best random subset used. Use <code>full = TRUE</code>
to compute (and later possibly plot) the full silhouette.

<h3>References</h3>

Rousseeuw, P.J. (1987)
Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math., 20, 53&ndash;65.

chapter 2 of Kaufman and Rousseeuw (1990), see
the references in <code><a href="plot.agnes.html">plot.agnes</a></code>.

<code><a href="partition.object.html">partition.object</a></code>, <code><a href="plot.partition.html">plot.partition</a></code>.

<h3>Examples</h3>

<pre>
data(ruspini)
pr4 &lt;- pam(ruspini, 4)
str(si &lt;- silhouette(pr4))
(ssi &lt;- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring

si2 &lt;- silhouette(pr4$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)

op &lt;- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
          mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
   plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman &amp; Rousseeuw, p.101",
      outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame()

## the same with cluster-wise colours:
c6 &lt;- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20")
for(k in 2:6)
   plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE,
        col = c6[1:k])
par(op)

## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k &lt;- xclara[ sample(nrow(xclara), size = 1000) ,]) # rownames == indices
cl3 &lt;- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf &lt;- silhouette(cl3, full = TRUE) ## this is the same as
s.full &lt;- silhouette(cl3$clustering, daisy(xc1k))
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tolerance = 0))
## color dependent on original "3 groups of each 1000": % __FIXME ??__
plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
     main ="plot(silhouette(clara(.), full = TRUE))")

## Silhouette for a hierarchical clustering:
ar &lt;- agnes(ruspini)
si3 &lt;- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
 	 daisy(ruspini))
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 &lt;- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)
</pre>

<hr /><div style="text-align: center;">[Package cluster version 2.0.8 <a href="00Index.html">Index</a>]</div>
</body></html>