EVOLUTION-MANAGER
Edit File: concordance.Rnw
\documentclass{article}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} %\VignetteIndexEntry{Concordance} \SweaveOpts{keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} % I had been putting figures in the figures/ directory, but the standard % R build script does not copy it and then R CMD check fails \SweaveOpts{prefix.string=compete,width=6,height=4} \newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth] {compete-#1.pdf}} \setkeys{Gin}{width=\textwidth} <<echo=FALSE>>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=10) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #ensure default #require("survival") #library(survival) library(survival) library(splines) @ \title{Concordance} \author{Terry Therneau, Elizabeth Atkinson} \newcommand{\code}[1]{\texttt{#1}} \begin{document} \maketitle \section{The concordance statistic} Use of the concordance statistic for Cox models was popularized by Harrell \cite{Harrell96}, and it is now the most used measure of goodness-of-fit in survival models. One advantage of the statistic is that it is well defined not just for survival models, but also for logistic and ordinary linear regression. In general let $y_i$ and $x_i$ be observed and predicted data values, in the most common case $x = \hat\eta$, the linear predictor from a fitted model. The concordance is defined $P(x_i > x_j | y_i > y_j)$, the probability that the prediction $x$ goes in the same direction as the actual data $y$. A pair of obervations $i$, $j$ is considered concordant if the prediction and the data go in the same direction, i.e., $(y_i > y_j, x_i > x_j)$ or $(y_i < y_j, x_i < x_j)$. The concordance is the fraction of concordant pairs. For a Cox model remember that the predicted survival $\hat y$ is longer if the risk score $X\beta$ is lower, so we have to flip the definitions of concordant and discordant. For now, ignore this detail and use the usual definition for exposition. One wrinkle is what to do with ties in either $y$ or $x$. Such pairs can be ignored in the count (treated as incomparable), treated as discordant, or given a score of 1/2. Let $C, D, T_x, T_y$ and $T_{xy}$ be a count of the pairs that are concordant, discordant, and tied on the predictor $x$ (but not $y$), tied on $y$ (but not $x$), and tied on both. Then \begin{align} \tau_a &= \frac{C-D}{C+D + T_x + T_y + T_{xy}} \label{tau.a} \\ \tau_b &= \frac{C-D}{\sqrt{(C+D+T_x)(C + D + T_y)}} \label{tau.b} \\ \gamma &= \frac{C-D}{C+D} \label{gamma} \\ d &= \frac{C-D}{C + D + T_x} \label{somer} \end{align} \begin{itemize} \item Kendall's tau-a \eqref{tau.a} is the most conservative; essentially treating ties as failures \item The Goodman-Kruskal $\gamma$ statistic \eqref{gamma} ignores ties in either $y$ or $x$. \item Somers' $d$ \eqref{somer} treats ties in $y$ as incomparable; pairs that are tied in $x$ (but not $y$) score as 1/2. The AUC measure commonly used in logistic regression is equivalent to Somers' $d$. \end{itemize} All three of the above range from -1 to 1. The concordance is $(d+1)/2$. The concordance has a natural interpretation as an experiment: present pairs of subjects one at a time to the physician, statistical model, or some other oracle, and count the number of correct predictions. Pairs that have the same outcome are not put forward for scoring (that would be unfair to the contestant); and if the oracle cannot decide it makes a random choice. This leads to $C + T_x/2$ correct selections out of $C + D + T_x$ choices, which is easily seen to be equal to $(d+1)/2$. Kendall's tau-b has a denominator of the same type, but treats $x$ and $y$ symmetrically. This hypothetical experiment gives a baseline insight into the concordance. A value of 1/2 corresponds to using a random guess for each subject. Values of $<.55$ are not very impressive, since the ordering for some pairs of subjects will be obvious, and someone with almost no medical knowledge could do nearly that well by marking those pairs and using a coin flip for the rest. Values of less than 1/2 are possible --- some stock market analysts come to mind. For survival data any pairs which cannot be ranked with certainty are also considered incomparable. For instance is $y_i$ is censored at time 10 and $y_j$ is an event (or censor) at time 20. Subject $i$ may or may not survive longer than subject $j$, and so it is not possible to tell if the rule has ranked them correctly or not. Note that if $y_i$ is censored at time 10 and $y_j$ is an event at time 10 then $y_i > y_j$. For stratified models, observations that are in different strata are also considered to be incomparable. \section{Examples} The \code{concordance} function available in the survival package can be used to estimate concordance for various types of models, including logistic and linear regression. <<examples1>>= # logistic regression using Fisher's iris data fit1 <- glm(Species=="versicolor" ~ ., family=binomial, data=iris) concordance(fit1) # this gives the AUC # linear regression fit2 <- lm(karno ~ age + trt, data=veteran) concordance(fit2) # 2*concordance-1 = somers' d @ The primary the focus of this vignette will be on survival data. <<examples2>>= # parametric survival regression fit3 <- survreg(Surv(time, status) ~ karno + age + trt, data=veteran) concordance(fit3) # 3 Cox models fit4 <- coxph(Surv(time, status) ~ karno + age + trt, data=veteran) fit5 <- update(fit4, . ~ . + celltype) fit6 <- update(fit5, . ~ . + prior) ctest <- concordance(fit4, fit5, fit6) ctest @ As shown in the last example, the concordance for multiple fits can be obtained from a single call. The variance-covariance matrix for all three concordance values is available using \code{vcov(ctest)}; this is used in an example below to formally test equality of two concordance values. The above also shows that addition of another variable to a fitted model can decrease the concordance. Concordance can be visualized using its contribution at each event time as shown below. The vertial axis on the graph is the rank of each of the risk score for each subject who has an event, among all those with at risk. Each rank ranges from 0 to 1, where 1= the subject who died had the highest risk score at that time and 0 = lowest. In a highly predictive Cox model we expect the subjects with the highest risks to die soonest. A log transformation on the x-axis was used to visually spread out the points. The concordance is a weigthed average of these $y$ values. After about 50 days, the baseline data has little predictive ability and the average ranks are close to 1/2. The second panel shows two possible weights, based on the number of subjects available at a given time, where $n(t)$, the default choice, is the number of comparable pairs at each time point. Alternate weights such as $S/G$ are discussed in section \ref{weighted}. For both choices the weights decrease precipitously over time and the final average is based largely on the left hand portion of the plot. <<rplot, fig=TRUE>>= par(mfrow=c(1,2)) c3 <- concordance(fit3, ranks=TRUE) c4 <- concordance(fit3, ranks=TRUE, timewt="S/G") # For technical reasons the code returns ranks on Somers' d scale, # from -1 to 1. Transform them to 0-1 d.to.c <- function(x) (x+1)/2 plot(d.to.c(rank) ~ time, data=c3$ranks, log='x', ylab="Rank") lfit <- with(c3$ranks, lowess(log(time), d.to.c(rank))) lines(exp(lfit$x), lfit$y, col=2, lwd=2) abline(.5,0, lty=2) matplot(c3$ranks$time, cbind(c3$ranks$timewt, c4$ranks$timewt), type="l", col=c("black", "red"), lwd=2, xlab="Time", ylab="Weight", log="x") legend("topright", legend=c("n(t)","S(t)/G(t)"), lty=1:2, col=c("black","red"), bty="n") @ \section{Connection to the proportional hazards model} Watson and Therneau \cite{Watson15} show that the numerator of Somers' $d$ for a response $y$ and predictor $x$ can be re-written as \begin{equation} C-D = 2 \sum \delta_i n(t_i) \left[ r_i(t_i) - 1/2 \right] \label{cscore} \end{equation} where $C$ and $D$ are the total number of concordance and discordant pairs, $n(t)$ is the number of subjects still at risk at time $t$, and $r_i(t)$ is the rank of $x_i$ among all those still at risk at time $t$, where ranks are defined such that $0 \le r \le 1$. It turns out that equation \eqref{cscore} is exactly the score statistic for a Cox model with a single time-dependent covariate $n(t) r(t)$. One immediate consequence of this connection is a straightforward definition of concordance for a risk score containing time dependent covariates. Since the Cox model score statistic is well defined for time dependent terms this justifies calculation of the values $C$, $D$, etc in the same way: at each event time the current risk score of the subject who failed is compared to the current scores of all those still at risk. \section{Variance} The variance of the statistic is estimated in two ways. The first is to use the variance of the equivalent Cox model score statistic. As pointed out by Watson, this estimate is both correct and efficient under $H_0: d=0$, a null concordance of 1/2, and so it forms a valid test of $H_0$. However, when the concordance is larger than 1/2 the estimate systematically overestimates the true variance. An alternative that remains unbiased is the infinitesimal jackknife (IJ) variance \begin{align*} V &= \sum_{i=1}^n w_iU_i^2 \\ U_i &= \frac{\partial c}{\partial w_i} \end{align*} The concordance routine calculates an influence matrix $U$ with one row per subject and columns that contain the derivitives of the 5 individual counts: concordant, discordant, tied on x, tied on y, and tied on xy pairs. From this it is straightforward to derive the influence of each subject on the concordance, or on any other of the other possible association measures \eqref{tau.a} -- \eqref{somer} mentioned earlier. The IJ variance is printed by default but the PH variance is also returned; the earlier \code{survConcordance} function only computed the PH variance. The PH variance is a simple sum of the variances for each of the $r_i$ terms shown in the prior plot of the ranks. Under $H_0$ each of these terms will be approximately uniformly distributed between -1 and 1, which implies that the variance of the individual terms will be about 1/3, and indeed this is true as seen in the plot below. The ranks $r$ are discrete rather than continuous, but this does not have any appreciable affect until the number at risk drops below about 5 or there are a substantial number of ties in the predictor. <<varest, fig=TRUE>>= plot(variance ~ time, c3$ranks, ylim=c(0, .34), log="x") abline(h=1/3, col=2) @ This suggests that the most \emph{statisticially efficient} estimate of the average rank would use weights of 1 rather than $n(t)$ when adding up the $r_i$ terms, which leads to something very like an ordinary Cox model or logrank test. However, it is no longer the concordance. One aside to remember, however, is the p-value from a Cox model will almost always be smaller than that for the concordance statistic for that same model, which is one side effect of this difference in efficiency. \section{Multiple concordances} One useful property of using a jackknife variance estimate is that the variance of the difference in concordance between two separately fitted models is also easily obtained. If $c_a$ and $c_b$ are the two concordance statistics and $U_{ia}$ and $U_{ib}$ the corresponding influence values, the influence vector for $c_a-c_b$ is $U_a - U_b$. (If subject $i$ increases $c_a$ by .03 and $c_b$ by .01, then he/she raises the difference between them by .02.) It is not necessary that the models be nested. However, it is crucial that they be computed on the exact same set of observations. Here is a comparison of concordance values from previous models. <<test>>= ctest <- concordance(fit4, fit5, fit6) ctest # compare concordance values of fit4 and fit5 contr <- c(-1, 1, 0) dtest <- contr %*% coef(ctest) dvar <- contr %*% vcov(ctest) %*% contr c(contrast=dtest, sd=sqrt(dvar), z=dtest/sqrt(dvar)) @ \subsection{Missing data} The lung data set has several variables with missing data and shows that care is necessary. <<lungcompare>>= colSums(is.na(lung)) # count missing values/variable # First attempt fit6 <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung) fit7 <- coxph(Surv(time, status) ~ meal.cal + pat.karno, data=lung) #tryCatch(concordance(fit6,fit7)) # produces an error # Second attempt lung2 <- na.omit(subset(lung, select= -c(inst, wt.loss))) fit6b <- coxph(Surv(time, status) ~ age + ph.ecog, data=lung2) fit7b <- coxph(Surv(time, status) ~ meal.cal + pat.karno, data=lung2) concordance(fit6b,fit7b) @ \section{Weighted concordance} \label{weighted} An interesting consequence of the equivalence between the concordance and the Cox model is the question of alternate weightings of the risk scores. Let $0 < r_i(t) < 1$ be the time dependent rank then the values of $2 (r_i(t) -1/2)$ range from -1 to 1 and their weighted sum is both the Cox score statistic and the numerator of Somers' $d$. If the original Cox model has a single 0/1 treatment covariate then $d$ is exactly the Gehan-Wilcoxon statistic; replacing these with weights of 1 instead of $n(t)$ will yield the log-rank statistic. There has been a long debate about the "best" weight for survival tests, and we can apply some of the same historical arguments to the concordance as well. We will point out four of interest: \begin{itemize} \item Peto and Peto \cite{Peto72} point out that $n(t) \approx n(0)S(t-)G(t-)$, where $S$ is the survival distribution and $G$ the censoring distribution. They argue that $S(t)$ would be a better weight since $G$ may have features that are irrelevant to the question being tested. For a particular data set Prentice \cite{Prentice79} later showed that these concerns were indeed justified, and almost all software now uses the Peto-Wilcoxon variant. \item Schemper et al \cite{Schemper09} argue for a weight of $S(t)/G(t)$ in the Cox model. When proportional hazards does not hold the coefficient from the Cox model is an ``average'' hazard ratio, and they show that using $S/G$ leads to a value that remains interpretable in terms of an underlying population model. The same argument may apply more strongly for the concordance, since the target is an ``assumption free'' assessment of association. \item Uno et al \cite{Uno11} recommend the use of $n/G^2$ as a weight based on a consistency argument. If we assume that the concordance value that would be obtained after full followup of all subjects (no censoring) is the ``right'' one, and proportional hazards does not hold, then the standard concordance will not consistently estimate this target quantity when there is censoring. It is ``biased''. (The Peto and Peto argument might suggest $S/G$ as an equivalent but more stable choice of weight.) \item Korn and Simon \cite{Korn90} point out the importance of restricting the range of comparison. Though a risk model can be used for long-range prediction, in actual patient practice this will not be the case; the model should be evaluated over the time range in which it will actually be used. For example, a physician is quite unlikely to look up my lab tests from 5 years ago, and then compute a 6 year survival probability forward from that point, in order to predict my outcome one year from today. \end{itemize} The \code{timewt} option allows you to modify weights for the concordance. The options are: \code{n, S, S/G, n/G, n/G2, I}, the last giving equal weight to each event time. The figure below shows the first four weights for the colon cancer data set. This data is from a clinical trial with 3 years of enrollment followed by 5 years of follow. Since there is almost no one lost to follow-up in the first 5 years all four weights are nearly identical over that time. From 5 to 8 years $S(t)$ continues its steady decline, $n(t)$ plummets due to administrative censoring, and $S/G$ explodes. Even with these changes in weights, the concordance values are all very similar. <<tmwt, fig=TRUE>>= colonfit <- coxph(Surv(time, status) ~ rx + nodes + extent, data=colon, subset=(etype==2)) # death only cord1 <- concordance(colonfit, timewt="n", ranks=TRUE) cord2 <- concordance(colonfit, timewt="S", ranks=TRUE) cord3 <- concordance(colonfit, timewt="S/G", ranks=TRUE) cord4 <- concordance(colonfit, timewt="n/G2", ranks=TRUE) c(n= coef(cord1), S=coef(cord2), "S/G"= coef(cord3), "n/G2"= coef(cord4)) matplot(cord1$ranks$time/365.25, cbind(cord1$ranks$timewt, cord2$ranks$timewt, cord3$ranks$timewt, cord4$ranks$timewt), type= "l", ylim= c(0, 6000), xlab="Years since enrollment", ylab="Weight") legend("left", c("n(t)", "S(t)", "S(t)/G(t)", "n(t)/G^2(t)"), lwd=2, col=1:4, lty=1:4, bty="n") @ A second example where we would expect weights to have a larger influence uses the \code{mgus2} data set. The underlying study has an unusually long median follow-up time of over 15 years, giving ample time for non-proportionality to manifest. Both creatinine and hemoglobin levels are associated with higher mortality, but of course a 10 year old marker is not nearly as predictive. The R data set has time in months, which has been converted to years for plotting purposes. <<mgus1, fig=TRUE>>= fit6 <- coxph(Surv(futime/12, death) ~ hgb, data=mgus2) zp <- cox.zph(fit6, transform="identity") plot(zp, df=4, resid=FALSE, ylim=c(-.4, .1), xlab="Years") abline(0,0, lty=3, col=2) @ The predictive effect of hemoglobin drops to less than half after 5 years, with a maximum study follow-up of 35 years. Now calculate the weighted concordance using four approaches. <<mgus2, fig=TRUE>>= c6a <- concordance(fit6, timewt="n", ranks=TRUE) c6b <- concordance(fit6, timewt="S", ranks=TRUE) c6c <- concordance(fit6, timewt="S/G", ranks=TRUE) c6d <- concordance(fit6, timewt="n/G2", ranks=TRUE) c(n= coef(c6a), S=coef(c6b), "S/G"= coef(c6c), "n/G2"= coef(c6d)) par(mfrow=c(1,2)) rfit <- lm(rank ~ ns(time,3), data=c6a$ranks) termplot(rfit, se=TRUE, col.se=1, col.term=1, xlab="Years", ylab="Smoothed rank") matplot(c6a$ranks$time, cbind(c6a$ranks$timewt, c6b$ranks$timewt, c6c$ranks$timewt, c6d$ranks$timewt), type= "l", xlab="Years since enrollment", ylab="Weight") # distribution of death times quantile(c6a$ranks$time) @ Surprisingly, the four weightings still yield almost identical concordance values. A clue to this lies in the quantile result. The concordance is a weighted mean of the ranks, with one term per death time in the sum. The quantile shows that 1/2 the deaths occur before 5 years and 3/4 of them before 9; the larger weights simply play only a small role in the overll sum. So, which weight should we use? As shown in the examples above, it may not matter that much. The fact that we have not found examples where the effect is large does not mean there are no such data sets, however. For survival data, one important issue that has not been sorted out is how to extend the weighting arguments to data sets that are subject to delayed entry, e.g., when using age as the time scale instead of time since enrollment. Since this usage is moderately frequent, and also because it is not possible for the \code{coxph} routine to reliably tell the difference between such left censoring and simple time-dependent covariates or strata, the default action of the routine is to use the safe choice of $n(t)$ and not to impose a range restriction. Further issues to consider: \begin{enumerate} \item Consider setting a time ($y$) restriction using the \code{ymin} or \code{ymax} options, based on careful thought about the proper range of interest. This part of the choice may often have the largest practical effect. \item Safety. If using the usual Gehan-Wilcoxon weights of $n(t)$, the Peto-Wilcoxon variant $S(t)$ would appear advantageous, particularly if there is differential censoring for some subjects. \item Current data versus future invariance. On the one hand assessment of a model should make use of the all available data, ``make the most with what you have'', but on the other hand we would prefer the estimated concordance to stay stable as a study's follow-up matures, which argues for $S/G$ or $n/G^2$ weights. If $G(t)$ approaches 0, however, these weights can become unstable and so may need to be combined with a time restriction. \item Equality vs. efficiency. On one hand we would like to treat each data pair equally, but in our quest for ever sharper p-values we want to be efficient. The first argues for $n(t)$ as the weight and the second for using equal weights, since the variances of each term are the same. This is exactly the argument between the Gehan-Wilcoxon and the log-rank tests. \item For uncensored data $n$, $S$ and $S/G$ weights are all identical. \end{enumerate} Our current opinion is that since the point of the concordance is to evaluate the model in a more non-parametric way, so a log-rank type of focus on ideal p-values is misplaced. This suggests using either $S$ or $S/G$ as weights. Both give more prominence to the later time points as compared to the default $n(t)$ choice, but if time limits have been thought through carefully the difference will often be minor. We most definitely do not agree that the ``proper'' target for estimation is always the $c$ statistic one would obtain with infinite follow-up and no censoring, which is the unstated assumption underlying Uno's assertion that the ordinary concordance is biased and must be repaired by use of $n/G^2$. Proportional hazards never is true over the long term, simply because it is almost impossible to predict events that are a decade or more away and thus the points in the $r(t)$ plot above will eventually tend to 0. The starting point should always be to think through exactly \emph{what} one wants to estimate. As stated by Yogi Berra ``if you don't know where you are going, you might not get there.'' \section{Details} This section documents a few details - most readers can skip it. The usual convention for survival data is to assume that censored values come after deaths, even if they are recorded on the same day. This corresponds to the common case that a subject who is censored on day 200, say, was actually seen on that day. That is, their survival is strictly greater than 200. As a consequence, censoring weights $G$ actually use $G(t-)$ in the code: if 10 subjects are censored at day 100, and these are the first censorings in the study, then an event on day 100 should not be given a larger weight. (Both the Uno and Schemper papers ignore this detail.) When using weights of $S(t)$ the program actually uses a weight of $nS(t)$ where $n$ is the number of subjects. The reason is that for a stratified model the weighted number of concordant, discordant and tied pairs is calculated separately for each stratum, and then added together. If one stratum were much smaller or larger than the others we want to preserved this fact in the sum. Consider a time point $t$ at which there were 3 events out of 40 subjects at risk. For the the ordinary concordance the time weight at this point will be the number of \emph{comparable pairs} for each of the 3 events, i.e., 37 for each. The rank of an event will be 1 if its predictor is smaller than all 37 comparitors and -1 if it is larger than all 37. When using a weight of $S(t)$ the weight will be $nS(t)$ at that point, for $S/G$ it will be $nS(t)/G(t-)$. The Cox model computation takes a slightly different view, in that all 40 subjects are considered to be at risk for each of the 3 events. The upshot of this is that the time weight changes to 40 rather than 37, while the rank becomes smaller by a factor of 37/40. The weighted sum remains the same, i.e., the Cox score statistic is equal to the numerator of the concordance statistic. For other weights $S(t-)$ replaces $S(t)$ in the Cox calculation, the individual ranks again shrink slightly but the weighted sum stays the same. \bibliography{refer} \bibliographystyle{plain} \end{document}