EVOLUTION-MANAGER

Edit File: smooth.spline.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Fit a Smoothing Spline</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
</head><body>

<table width="100%" summary="page for smooth.spline {stats}"><tr><td>smooth.spline {stats}</td><td style="text-align: right;">R Documentation</td></tr></table>

<h2>Fit a Smoothing Spline</h2>

<h3>Description</h3>

Fits a cubic smoothing spline to the supplied data.

<h3>Usage</h3>

<pre>
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
 all.knots = FALSE, nknots = .nknots.smspl,
 keep.data = TRUE, df.offset = 0, penalty = 1,
 control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)
</pre>

<h3>Arguments</h3>

<table summary="R argblock">
<tr valign="top"><td><code>x</code></td>
<td>
a vector giving the values of the predictor variable, or a
list or a two-column matrix specifying x and y. 
</td></tr>
<tr valign="top"><td><code>y</code></td>
<td>
responses. If <code>y</code> is missing or <code>NULL</code>, the responses
are assumed to be specified by <code>x</code>, with <code>x</code> the index
vector.
</td></tr>
<tr valign="top"><td><code>w</code></td>
<td>
optional vector of weights of the same length as <code>x</code>;
defaults to all 1.
</td></tr>
<tr valign="top"><td><code>df</code></td>
<td>
the desired equivalent number of degrees of freedom (trace of
the smoother matrix). Must be in (1,nx],
nx the number of unique x values, see below.
</td></tr>
<tr valign="top"><td><code>spar</code></td>
<td>
smoothing parameter, typically (but not necessarily) in
(0,1]. When <code>spar</code> is specified, the coefficient
&lambda; of the integral of the squared second derivative in the
fit (penalized log likelihood) criterion is a monotone function of
<code>spar</code>, see the details below. Alternatively <code>lambda</code> may
be specified instead of the scale free <code>spar</code>=s.
</td></tr>
<tr valign="top"><td><code>lambda</code></td>
<td>
if desired, the internal (design-dependent) smoothing
parameter &lambda; can be specified instead of <code>spar</code>.
This may be desirable for resampling algorithms such as cross
validation or the bootstrap.
</td></tr>
<tr valign="top"><td><code>cv</code></td>
<td>
ordinary leave-one-out (<code>TRUE</code>) or &lsquo;generalized&rsquo;
cross-validation (GCV) when <code>FALSE</code>; is used for smoothing
parameter computation only when both <code>spar</code> and <code>df</code> are
not specified; it is used however to determine <code>cv.crit</code> in the
result. Setting it to <code>NA</code> for speedup skips the evaluation of
leverages and any score.
</td></tr>
<tr valign="top"><td><code>all.knots</code></td>
<td>
if <code>TRUE</code>, all distinct points in <code>x</code> are used
as knots. If <code>FALSE</code> (default), a subset of <code>x[]</code> is used,
specifically <code>x[j]</code> where the <code>nknots</code> indices are evenly
spaced in <code>1:n</code>, see also the next argument <code>nknots</code>.

Alternatively, a strictly increasing <code><a href="../../base/html/numeric.html">numeric</a></code> vector
specifying &ldquo;all the knots&rdquo; to be used; must be rescaled
to [0, 1] already such that it corresponds to the
<code>ans $ fit$knots</code> sequence returned, not repeating the boundary
knots.
</td></tr>
<tr valign="top"><td><code>nknots</code></td>
<td>
integer or <code><a href="../../base/html/function.html">function</a></code> giving the number of
knots to use when <code>all.knots = FALSE</code>. If a function (as by
default), the number of knots is <code>nknots(nx)</code>. By default for
nx &gt; 49 this is less than nx, the number
of unique <code>x</code> values, see the Note.
</td></tr>
<tr valign="top"><td><code>keep.data</code></td>
<td>
logical specifying if the input data should be kept
in the result. If <code>TRUE</code> (as per default), fitted values and
residuals are available from the result.
</td></tr>
<tr valign="top"><td><code>df.offset</code></td>
<td>
allows the degrees of freedom to be increased by
<code>df.offset</code> in the GCV criterion.
</td></tr>
<tr valign="top"><td><code>penalty</code></td>
<td>
the coefficient of the penalty for degrees of freedom
in the GCV criterion.
</td></tr>
<tr valign="top"><td><code>control.spar</code></td>
<td>
optional list with named components controlling the
root finding when the smoothing parameter <code>spar</code> is computed,
i.e., missing or <code>NULL</code>, see below.

Note that this is partly experimental and may change
with general spar computation improvements!

<dl>
<dt>low:</dt><dd>lower bound for <code>spar</code>; defaults to -1.5 (used to
implicitly default to 0 in R versions earlier than 1.4).
</dd>
<dt>high:</dt><dd>upper bound for <code>spar</code>; defaults to +1.5.
</dd>
<dt>tol:</dt><dd>the absolute precision (tolerance) used; defaults
to 1e-4 (formerly 1e-3).
</dd>
<dt>eps:</dt><dd>the relative precision used; defaults to 2e-8 (formerly
0.00244).
</dd>
<dt>trace:</dt><dd>logical indicating if iterations should be traced.
</dd>
<dt>maxit:</dt><dd>integer giving the maximal number of iterations;
defaults to 500.
</dd>
</dl>

Note that <code>spar</code> is only searched for in the interval
[low, high].

</td></tr>
<tr valign="top"><td><code>tol</code></td>
<td>
a tolerance for same-ness or uniqueness of the <code>x</code>
values. The values are binned into bins of size <code>tol</code> and
values which fall into the same bin are regarded as the same. Must
be strictly positive (and finite).
</td></tr>
<tr valign="top"><td><code>keep.stuff</code></td>
<td>
an experimental <code><a href="../../base/html/logical.html">logical</a></code> indicating if
the result should keep extras from the internal computations. Should
allow to reconstruct the X matrix and more.
</td></tr>
</table>

<h3>Details</h3>

Neither <code>x</code> nor <code>y</code> are allowed to containing missing or
infinite values.

The <code>x</code> vector should contain at least four distinct values.
&lsquo;Distinct&rsquo; here is controlled by <code>tol</code>: values which are
regarded as the same are replaced by the first of their values and the
corresponding <code>y</code> and <code>w</code> are pooled accordingly.

Unless <code>lambda</code> has been specified instead of <code>spar</code>,
the computational &lambda; used (as a function of
\code{spar}) is
&lambda; = r * 256^(3*spar - 1)
where
r = tr(X' W X) / tr(&Sigma;),
&Sigma; is the matrix given by
&Sigma;[i,j] = Integral B''[i](t) B''[j](t) dt,
X is given by X[i,j] = B[j](x[i]),
W is the diagonal matrix of weights (scaled such that
its trace is n, the original number of observations)
and B[k](.) is the k-th B-spline.

Note that with these definitions, f_i = f(x_i), and the B-spline
basis representation f = X c (i.e., c is
the vector of spline coefficients), the penalized log likelihood is
L = (y - f)' W (y - f) + &lambda; c' &Sigma; c, and hence
c is the solution of the (ridge regression)
(X' W X + &lambda; &Sigma;) c = X' W y.

If <code>spar</code> and <code>lambda</code> are missing or <code>NULL</code>, the value
of <code>df</code> is used to determine the degree of smoothing. If
<code>df</code> is missing as well, leave-one-out cross-validation (ordinary
or &lsquo;generalized&rsquo; as determined by <code>cv</code>) is used to
determine &lambda;.

Note that from the above relation,

<code>spar</code> is spar = s0 + 0.0601 * log(&lambda;),
which is intentionally different from the S-PLUS implementation
of <code>smooth.spline</code> (where <code>spar</code> is proportional to
&lambda;). In R's (log &lambda;) scale, it makes more
sense to vary <code>spar</code> linearly.

Note however that currently the results may become very unreliable
for <code>spar</code> values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
<code>spar</code> or the controls <code>low</code> and <code>high</code> outside such a
safe range, unless you know what you are doing!
Similarly, specifying <code>lambda</code> instead of <code>spar</code> is
delicate, notably as the range of &ldquo;safe&rdquo; values for
<code>lambda</code> is not scale-invariant and hence entirely data dependent.

The &lsquo;generalized&rsquo; cross-validation method GCV will work correctly when
there are duplicated points in <code>x</code>. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. <code>cv = TRUE</code> is best avoided in that case.

<h3>Value</h3>

An object of class <code>"smooth.spline"</code> with components

<table summary="R valueblock">
<tr valign="top"><td><code>x</code></td>
<td>
the distinct <code>x</code> values in increasing order, see
the &lsquo;Details&rsquo; above.
</td></tr>
<tr valign="top"><td><code>y</code></td>
<td>
the fitted values corresponding to <code>x</code>.
</td></tr>
<tr valign="top"><td><code>w</code></td>
<td>
the weights used at the unique values of <code>x</code>.
</td></tr>
<tr valign="top"><td><code>yin</code></td>
<td>
the y values used at the unique <code>y</code> values.
</td></tr>
<tr valign="top"><td><code>tol</code></td>
<td>
the <code>tol</code> argument (whose default depends on <code>x</code>).
</td></tr>
<tr valign="top"><td><code>data</code></td>
<td>
only if <code>keep.data = TRUE</code>: itself a
<code><a href="../../base/html/list.html">list</a></code> with components <code>x</code>, <code>y</code> and <code>w</code>
of the same length. These are the original (x_i,y_i,w_i),
 i = 1, &hellip;, n, values where <code>data$x</code> may have repeated values and
hence be longer than the above <code>x</code> component; see details.

</td></tr>
<tr valign="top"><td><code>lev</code></td>
<td>
(when <code>cv</code> was not <code>NA</code>) leverages, the diagonal
values of the smoother matrix.
</td></tr>
<tr valign="top"><td><code>cv.crit</code></td>
<td>
cross-validation score, &lsquo;generalized&rsquo; or true, depending
on <code>cv</code>. The CV score is often called &ldquo;PRESS&rdquo; (and
labeled on <code><a href="../../base/html/print.html">print</a>()</code>), for &lsquo;PREdiction
Sum of Squares&rsquo;.
</td></tr>
<tr valign="top"><td><code>pen.crit</code></td>
<td>
the penalized criterion, a non-negative number; simply
the (weighted) residual sum of squares (RSS), <code> sum(.$w * residuals(.)^2) </code>.
</td></tr>
<tr valign="top"><td><code>crit</code></td>
<td>
the criterion value minimized in the underlying
<code>.Fortran</code> routine &lsquo;sslvrg&rsquo;. When <code>df</code> has been specified,
the criterion is 3 + (tr(S[lambda]) - df)^2,
where the 3 + is there for numerical (and historical) reasons.
</td></tr>
<tr valign="top"><td><code>df</code></td>
<td>
equivalent degrees of freedom used. Note that (currently)
this value may become quite imprecise when the true <code>df</code> is
between and 1 and 2.

</td></tr>
<tr valign="top"><td><code>spar</code></td>
<td>
the value of <code>spar</code> computed or given, unless it has been
given as <code>c(lambda = *)</code>, when it set to <code>NA</code> here.
</td></tr>
<tr valign="top"><td><code>ratio</code></td>
<td>
(when <code>spar</code> above is not <code>NA</code>), the value
r, the ratio of two matrix traces.
</td></tr>
<tr valign="top"><td><code>lambda</code></td>
<td>
the value of &lambda; corresponding to <code>spar</code>,
see the details above.
</td></tr>
<tr valign="top"><td><code>iparms</code></td>
<td>
named integer(3) vector where <code>..$ipars["iter"]</code>
gives number of spar computing iterations used.
</td></tr>
<tr valign="top"><td><code>auxMat</code></td>
<td>
experimental; when <code>keep.stuff</code> was true, a
&ldquo;flat&rdquo; numeric vector containing parts of the internal computations.
</td></tr>
<tr valign="top"><td><code>fit</code></td>
<td>
list for use by <code><a href="predict.smooth.spline.html">predict.smooth.spline</a></code>, with
components

<dl>
<dt>knot:</dt><dd>the knot sequence (including the repeated boundary
knots), scaled into [0, 1] (via <code>min</code> and
<code>range</code>).
</dd>
<dt>nk:</dt><dd>number of coefficients or number of &lsquo;proper&rsquo;
knots plus 2.
</dd>
<dt>coef:</dt><dd>coefficients for the spline basis used.
</dd>
<dt>min, range:</dt><dd>numbers giving the corresponding quantities of
<code>x</code>.
</dd>
</dl>

</td></tr>
<tr valign="top"><td><code>call</code></td>
<td>
the matched call.
</td></tr>
</table>
<code>method(class = "smooth.spline")</code> shows a
<code><a href="influence.measures.html">hatvalues</a>()</code> method based on the <code>lev</code> vector above.

The number of unique <code>x</code> values, nx, are
determined by the <code>tol</code> argument, equivalently to 
<pre>
 nx &lt;- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))</pre>
The default <code>all.knots = FALSE</code> and <code>nknots = .nknots.smspl</code>,
entails using only O(nx ^ 0.2)
knots instead of nx for nx &gt; 49. This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only O(nk) + O(n) where
nk is the number of knots.

In this case where not all unique <code>x</code> values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
<code>df</code>) is used.

<h3>Author(s)</h3>

R implementation by B. D. Ripley and Martin Maechler
(<code>spar/lambda</code>, etc).

<h3>Source</h3>

This function is based on code in the <code>GAMFIT</code> Fortran program by
T. Hastie and R. Tibshirani (originally taken from
<a href="http://lib.stat.cmu.edu/general/gamfit">http://lib.stat.cmu.edu/general/gamfit</a>)
which makes use of spline code by Finbarr O'Sullivan. Its design
parallels the <code>smooth.spline</code> function of Chambers &amp; Hastie (1992).

<h3>References</h3>

Chambers, J. M. and Hastie, T. J. (1992)
Statistical Models in S, Wadsworth &amp; Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994)
Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach. Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990)
Generalized Additive Models. Chapman and Hall.

<code><a href="predict.smooth.spline.html">predict.smooth.spline</a></code> for evaluating the spline
and its derivatives.

<h3>Examples</h3>

<pre>
require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars) &amp; smoothing splines")
cars.spl &lt;- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE

lines(cars.spl, col = "blue")
ss10 &lt;- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] =&gt; df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg = 'bisque')

## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")

## consistency check:
stopifnot(all.equal(cars$dist,
                    fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)

## Visualize the behavior of  .nknots.smspl()
nKnots &lt;- Vectorize(.nknots.smspl) ; c.. &lt;- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)

n &lt;- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 &lt;- c(50, 200, 800, 3200); c0 &lt;- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)

##-- artificial example
y18 &lt;- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx  &lt;- seq(1, length(y18), len = 201)
(s2   &lt;- smooth.spline(y18)) # GCV
(s02  &lt;- smooth.spline(y18, spar = 0.2))
(s02. &lt;- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

## Specifying 'lambda' instead of usual spar :
(s2. &lt;- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))

## The following shows the problematic behavior of 'spar' searching:
(s2 &lt;- smooth.spline(y18, control =
 list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m &lt;- smooth.spline(y18, cv = TRUE, control =
 list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)
</pre>

<hr /><div style="text-align: center;">[Package stats version 3.6.0 <a href="00Index.html">Index</a>]</div>
</body></html>