EVOLUTION-MANAGER
Edit File: agnes.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Agglomerative Nesting (Hierarchical Clustering)</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="stylesheet" type="text/css" href="R.css" /> </head><body> <table width="100%" summary="page for agnes {cluster}"><tr><td>agnes {cluster}</td><td style="text-align: right;">R Documentation</td></tr></table> <h2>Agglomerative Nesting (Hierarchical Clustering)</h2> <h3>Description</h3> <p>Computes agglomerative hierarchical clustering of the dataset. </p> <h3>Usage</h3> <pre> agnes(x, diss = inherits(x, "dist"), metric = "euclidean", stand = FALSE, method = "average", par.method, keep.diss = n < 100, keep.data = !diss, trace.lev = 0) </pre> <h3>Arguments</h3> <table summary="R argblock"> <tr valign="top"><td><code>x</code></td> <td> <p>data matrix or data frame, or dissimilarity matrix, depending on the value of the <code>diss</code> argument. </p> <p>In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed. </p> <p>In case of a dissimilarity matrix, <code>x</code> is typically the output of <code><a href="daisy.html">daisy</a></code> or <code><a href="../../stats/html/dist.html">dist</a></code>. Also a vector with length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed. </p> </td></tr> <tr valign="top"><td><code>diss</code></td> <td> <p>logical flag: if TRUE (default for <code>dist</code> or <code>dissimilarity</code> objects), then <code>x</code> is assumed to be a dissimilarity matrix. If FALSE, then <code>x</code> is treated as a matrix of observations by variables. </p> </td></tr> <tr valign="top"><td><code>metric</code></td> <td> <p>character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are <code>"euclidean"</code> and <code>"manhattan"</code>. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If <code>x</code> is already a dissimilarity matrix, then this argument will be ignored. </p> </td></tr> <tr valign="top"><td><code>stand</code></td> <td> <p>logical flag: if TRUE, then the measurements in <code>x</code> are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If <code>x</code> is already a dissimilarity matrix, then this argument will be ignored. </p> </td></tr> <tr valign="top"><td><code>method</code></td> <td> <p>character string defining the clustering method. The six methods implemented are <code>"average"</code> ([unweighted pair-]group [arithMetic] average method, aka ‘UPGMA’), <code>"single"</code> (single linkage), <code>"complete"</code> (complete linkage), <code>"ward"</code> (Ward's method), <code>"weighted"</code> (weighted average linkage, aka ‘WPGMA’), its generalization <code>"flexible"</code> which uses (a constant version of) the Lance-Williams formula and the <code>par.method</code> argument, and <code>"gaverage"</code> a generalized <code>"average"</code> aka “flexible UPGMA” method also using the Lance-Williams formula and <code>par.method</code>. </p> <p>The default is <code>"average"</code>. </p> </td></tr> <tr valign="top"><td><code>par.method</code></td> <td> <p>If <code>method</code> is <code>"flexible"</code> or <code>"gaverage"</code>, a numeric vector of length 1, 3, or 4, (with a default for <code>"gaverage"</code>), see in the details section. </p> </td></tr> <tr valign="top"><td><code>keep.diss, keep.data</code></td> <td> <p>logicals indicating if the dissimilarities and/or input data <code>x</code> should be kept in the result. Setting these to <code>FALSE</code> can give much smaller results and hence even save memory allocation <em>time</em>.</p> </td></tr> <tr valign="top"><td><code>trace.lev</code></td> <td> <p>integer specifying a trace level for printing diagnostics during the algorithm. Default <code>0</code> does not print anything; higher values print increasingly more.</p> </td></tr> </table> <h3>Details</h3> <p><code>agnes</code> is fully described in chapter 5 of Kaufman and Rousseeuw (1990). Compared to other agglomerative clustering methods such as <code>hclust</code>, <code>agnes</code> has the following features: (a) it yields the agglomerative coefficient (see <code><a href="agnes.object.html">agnes.object</a></code>) which measures the amount of clustering structure found; and (b) apart from the usual tree it also provides the banner, a novel graphical display (see <code><a href="plot.agnes.html">plot.agnes</a></code>). </p> <p>The <code>agnes</code>-algorithm constructs a hierarchy of clusterings.<br /> At first, each observation is a small cluster by itself. Clusters are merged until only one large cluster remains which contains all the observations. At each stage the two <em>nearest</em> clusters are combined to form one larger cluster. </p> <p>For <code>method="average"</code>, the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster. <br /> In <code>method="single"</code>, we use the smallest dissimilarity between a point in the first cluster and a point in the second cluster (nearest neighbor method). <br /> When <code>method="complete"</code>, we use the largest dissimilarity between a point in the first cluster and a point in the second cluster (furthest neighbor method). </p> <p>The <code>method = "flexible"</code> allows (and requires) more details: The Lance-Williams formula specifies how dissimilarities are computed when clusters are agglomerated (equation (32) in K&R(1990), p.237). If clusters <i>C_1</i> and <i>C_2</i> are agglomerated into a new cluster, the dissimilarity between their union and another cluster <i>Q</i> is given by </p> <p style="text-align: center;"><i> D(C_1 \cup C_2, Q) = α_1 * D(C_1, Q) + α_2 * D(C_2, Q) + β * D(C_1,C_2) + γ * |D(C_1, Q) - D(C_2, Q)|, </i></p> <p>where the four coefficients <i>(α_1, α_2, β, γ)</i> are specified by the vector <code>par.method</code>, either directly as vector of length 4, or (more conveniently) if <code>par.method</code> is of length 1, say <i>= α</i>, <code>par.method</code> is extended to give the “Flexible Strategy” (K&R(1990), p.236 f) with Lance-Williams coefficients <i>(α_1 = α_2 = α, β = 1 - 2α, γ=0)</i>.<br /> Also, if <code>length(par.method) == 3</code>, <i>γ = 0</i> is set. </p> <p><b>Care</b> and expertise is probably needed when using <code>method = "flexible"</code> particularly for the case when <code>par.method</code> is specified of longer length than one. Since <span class="pkg">cluster</span> version 2.0, choices leading to invalid <code>merge</code> structures now signal an error (from the C code already). The <em>weighted average</em> (<code>method="weighted"</code>) is the same as <code>method="flexible", par.method = 0.5</code>. Further, <code>method= "single"</code> is equivalent to <code>method="flexible", par.method = c(.5,.5,0,-.5)</code>, and <code>method="complete"</code> is equivalent to <code>method="flexible", par.method = c(.5,.5,0,+.5)</code>. </p> <p>The <code>method = "gaverage"</code> is a generalization of <code>"average"</code>, aka “flexible UPGMA” method, and is (a generalization of the approach) detailed in Belbin et al. (1992). As <code>"flexible"</code>, it uses the Lance-Williams formula above for dissimilarity updating, but with <i>α_1</i> and <i>α_2</i> not constant, but <em>proportional</em> to the <em>sizes</em> <i>n_1</i> and <i>n_2</i> of the clusters <i>C_1</i> and <i>C_2</i> respectively, i.e, </p> <p style="text-align: center;"><i> α_j = α'_j * n_1/(n_1 + n_2),</i></p> <p>where <i>α'_1</i>, <i>α'_2</i> are determined from <code>par.method</code>, either directly as <i>(α_1, α_2, β, γ)</i> or <i>(α_1, α_2, β)</i> with <i>γ = 0</i>, or (less flexibly, but more conveniently) as follows: </p> <p>Belbin et al proposed “flexible beta”, i.e. the user would only specify <i>β</i> (as <code>par.method</code>), sensibly in </p> <p style="text-align: center;"><i>-1 ≤ β < 1,</i></p> <p>and <i>β</i> determines <i>α'_1</i> and <i>α'_2</i> as </p> <p style="text-align: center;"><i>α'_j = 1 - β,</i></p> <p> and <i>γ = 0</i>. </p> <p>This <i>β</i> may be specified by <code>par.method</code> (as length 1 vector), and if <code>par.method</code> is not specified, a default value of -0.1 is used, as Belbin et al recommend taking a <i>β</i> value around -0.1 as a general agglomerative hierarchical clustering strategy. </p> <p>Note that <code>method = "gaverage", par.method = 0</code> (or <code>par.method = c(1,1,0,0)</code>) is equivalent to the <code>agnes()</code> default method <code>"average"</code>. </p> <h3>Value</h3> <p>an object of class <code>"agnes"</code> (which extends <code>"twins"</code>) representing the clustering. See <code><a href="agnes.object.html">agnes.object</a></code> for details, and methods applicable. </p> <h3>BACKGROUND</h3> <p>Cluster analysis divides a dataset into groups (clusters) of observations that are similar to each other. </p> <dl> <dt>Hierarchical methods</dt><dd><p>like <code>agnes</code>, <code><a href="diana.html">diana</a></code>, and <code><a href="mona.html">mona</a></code> construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of observations.</p> </dd> <dt>Partitioning methods</dt><dd><p>like <code><a href="pam.html">pam</a></code>, <code><a href="clara.html">clara</a></code>, and <code><a href="fanny.html">fanny</a></code> require that the number of clusters be given by the user.</p> </dd> </dl> <h3>Author(s)</h3> <p>Method <code>"gaverage"</code> has been contributed by Pierre Roudier, Landcare Research, New Zealand. </p> <h3>References</h3> <p>Kaufman, L. and Rousseeuw, P.J. (1990). (=: “K&R(1990)”) <em>Finding Groups in Data: An Introduction to Cluster Analysis</em>. Wiley, New York. </p> <p>Anja Struyf, Mia Hubert and Peter J. Rousseeuw (1996) Clustering in an Object-Oriented Environment. <em>Journal of Statistical Software</em> <b>1</b>. doi: <a href="http://doi.org/10.18637/jss.v001.i04">10.18637/jss.v001.i04</a> </p> <p>Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997). Integrating Robust Clustering Techniques in S-PLUS, <em>Computational Statistics and Data Analysis</em>, <b>26</b>, 17–37. </p> <p>Lance, G.N., and W.T. Williams (1966). A General Theory of Classifactory Sorting Strategies, I. Hierarchical Systems. <em>Computer J.</em> <b>9</b>, 373–380. </p> <p>Belbin, L., Faith, D.P. and Milligan, G.W. (1992). A Comparison of Two Approaches to Beta-Flexible Clustering. <em>Multivariate Behavioral Research</em>, <b>27</b>, 417–433. </p> <h3>See Also</h3> <p><code><a href="agnes.object.html">agnes.object</a></code>, <code><a href="daisy.html">daisy</a></code>, <code><a href="diana.html">diana</a></code>, <code><a href="../../stats/html/dist.html">dist</a></code>, <code><a href="../../stats/html/hclust.html">hclust</a></code>, <code><a href="plot.agnes.html">plot.agnes</a></code>, <code><a href="twins.object.html">twins.object</a></code>. </p> <h3>Examples</h3> <pre> data(votes.repub) agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE) agn1 plot(agn1) op <- par(mfrow=c(2,2)) agn2 <- agnes(daisy(votes.repub), diss = TRUE, method = "complete") plot(agn2) ## alpha = 0.625 ==> beta = -1/4 is "recommended" by some agnS <- agnes(votes.repub, method = "flexible", par.meth = 0.625) plot(agnS) par(op) ## "show" equivalence of three "flexible" special cases d.vr <- daisy(votes.repub) a.wgt <- agnes(d.vr, method = "weighted") a.sing <- agnes(d.vr, method = "single") a.comp <- agnes(d.vr, method = "complete") iC <- -(6:7) # not using 'call' and 'method' for comparisons stopifnot( all.equal(a.wgt [iC], agnes(d.vr, method="flexible", par.method = 0.5)[iC]) , all.equal(a.sing[iC], agnes(d.vr, method="flex", par.method= c(.5,.5,0, -.5))[iC]), all.equal(a.comp[iC], agnes(d.vr, method="flex", par.method= c(.5,.5,0, +.5))[iC])) ## Exploring the dendrogram structure (d2 <- as.dendrogram(agn2)) # two main branches d2[[1]] # the first branch d2[[2]] # the 2nd one { 8 + 42 = 50 } d2[[1]][[1]]# first sub-branch of branch 1 .. and shorter form identical(d2[[c(1,1)]], d2[[1]][[1]]) ## a "textual picture" of the dendrogram : str(d2) data(agriculture) ## Plot similar to Figure 7 in ref ## Not run: plot(agnes(agriculture), ask = TRUE) data(animals) aa.a <- agnes(animals) # default method = "average" aa.ga <- agnes(animals, method = "gaverage") op <- par(mfcol=1:2, mgp=c(1.5, 0.6, 0), mar=c(.1+ c(4,3,2,1)), cex.main=0.8) plot(aa.a, which.plot = 2) plot(aa.ga, which.plot = 2) par(op) ## Show how "gaverage" is a "generalized average": aa.ga.0 <- agnes(animals, method = "gaverage", par.method = 0) stopifnot(all.equal(aa.ga.0[iC], aa.a[iC])) </pre> <hr /><div style="text-align: center;">[Package <em>cluster</em> version 2.0.8 <a href="00Index.html">Index</a>]</div> </body></html>