EVOLUTION-MANAGER
Edit File: survival.Rnw
\documentclass{report}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} %\VignetteIndexEntry{The survival package} \SweaveOpts{keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} % I had been putting figures in the figures/ directory, but the standard % R build script does not copy it and then R CMD check fails \SweaveOpts{prefix.string=compete,width=6,height=4} \newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth] {compete-#1.pdf}} \newcommand{\code}[1]{\texttt{#1}} \setkeys{Gin}{width=\textwidth} <<echo=FALSE>>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=10) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #ensure default require("survival") @ \title{The survival package} \author{Terry Therneau} \begin{document} \maketitle \chapter{Introduction} Work on the survival package began in 1985 in connection with the analysis of medical research data, without any realization at the time that the work would become a package. Eventually the software was placed on the Statlib repository hosted by Carnegie Mellon University. Multiple version were released in this fashion but I don't have a list of dates --- version 2 was the first to make use of the \code{print} method that was introduced in `New S' in 1988, which places that release somewhere in 1989. The library was eventually incorportated directly in S-Plus, and from there it became a standard part of R. Looking back, I think that one of the primary reasons for the package's success is that all of the functions have been written to solve real analysis questions that arose from real data sets; theoretical issues were explored when necessary but they never played a leading role. As a statistician in a major medical center, the central focus of my department is to advance medicine; statistics is a tool to that end. This also highlights one of the deficiencies of the package: if a particular analysis question has not yet arisen in one of my studies then the survival package will also have nothing to say on the topic. Luckily there are many other R packages that build on or extend the survival package, and anyone working in the field, the author included, will use more packages than this one. I certainly never foresaw that the library would become as popular as it has. This vignette is an introduction to version 3.x of the survival package. I think of versions 1.x as the S-Plus era and 2.1 -- 2.44 as maturation of the package in R. Version 3 had 4 major goals. \begin{itemize} \item Make mulit-state curves and models as easy to use as an ordinary Kaplan-Meier and Cox model. \item Deeper support for absolute risk models. \item More consitent support of robust variance estimates. \item Clean up various naming inconsistencies that have arisen over time. \end{itemize} With over 600 dependent packages in 2019, not counting Bioconductor, other guiding lights of the change are \begin{itemize} \item We can't do everything (so don't try). \item Allow other packages to build on this one. That means clear documentation of all of the objects that are produced, the use of simple objects (no S4 classes) that are easy to manipulate, and setting up many of the routines as a pair. For example \code{concordance} and \code{concordance.fit}; the former is the user front end and the latter does the actual work. Other package authors might want to access the lower level interface. \item Don't bugger it up! \end{itemize} This meant preserving the current argument names as much as possible. Appendix \ref{sect:changes} summarizes changes that were made which are not backwards compatable. The two other major changes are to collapse the many vignettes into this single large one, and the parallel creation of an actual book. This latter recognizes that the package needs more than a vignette. With the book's appearance this vignette can be more brief, essentially leaving out a lot of the theory. Version 3 will not appear all at once, however; it will take some time to get all of the documentation sorted out in the way that we like. \section{Changes in version 3} Version 3.0 of the package was released in conjunction with a book. Writing the book, and in particular the examples, revealed some shortcomings in the design. In particular, there were some common concepts which had appeared piecemeal in more than one function, but not using the same keywords. Two particular areas are survival curves and multiple observations per subject. Survival and cumulative hazard curves are genereated by the \code{survfit} function, either from raw data (survfit.formula), or a fitted Cox or parametric survival model (survfit.coxph, survfit.survreg). Two choices that appear are \begin{enumerate} \item If there are tied event times, to estimate the hazard using a straightforward increment of (number of events)/(number at risk), or make a correction for the ties. The simpler method is known variously as the Nelson, Aalen, Breslow, and Tsiatis estimate, along with hypenenated forms combining 2 or 3 of them. The same basic formula has been re-created in many contexts. One of the simpler corrections for ties is known as the Fleming-Harrington approximation when used with raw data, and the Efron when used in a Cox model. \item The survival curve $S(t)$ can be estimated directly or as the exponential of the cumulative hazard estimate. The first of these is known as the Kaplan-Meier, cumulative incidence (CI), Aalen-Johansen, and Kalbfleisch-Prentice estimate, depending on context, the second as a Fleming-Harringtion, Breslow, or Efron estimate, again depending on context. \end{enumerate} With respect to the two above, subtypes of the \code{survfit} routine have had either a \code{type} or \code{method} argument over the years which tried to capture both of these at the same time, and consequently have had a bewildering number of options, for example ``fleming-harrington'' in \code{survfit.formula} stood for the simple cumulative hazard estimate plus the exponential survival estimate, ``fh2'' specified the tie-corrected cumulative hazard plus exponential survival, while \code{survfit.coxph} used ``breslow'' and ``efron'' for the same two combinations. The updated routines now have separate \code{stype} and \code{ctype} arguments. For the first 1= direct and 2=exponent of the cumulative hazard and for the second 1=simple and 2= corrected. The Cox model is a special case in two ways: 1. the the way in which ties are treated in the likelihood should match the way they are treated in creating the hazard and 2. the direct estimate of survival can be very difficult to compute. The survival package's default is to use the \code{ctype} option which matches the ties option of the \code{coxph} call along with an exponential estimate of survival. This \code{ctype} choice preserves some useful properties of the martingale residuals. A second issue is multiple observations per subject, and how those impact the computations. This leads to 3 common arguments of \begin{itemize} \item id: an identifier in each row of the data, which allows the routines to identify multiple rows for a subject \item cluster: identify correlated rows, which should be combined when creating the robust variance \item robust: TRUE or FALSE, to compute a robust variance. \end{itemize} These arguments have been inconsistent in the past, partly because of the sequential appearance of multiple use cases. The package started with only the simplest data form: one observation per subject, one endpoint. To this has been added \begingroup \renewcommand{\theenumi}{\alph{enumi}} \begin{enumerate} \item Multiple observations per subject \item Multiple endpoints per subject \item Multiple types of endpoints \end{enumerate} \endgroup Case (a) arises as a way to code time-dependent covariates, and in this case an \code{id} statement is not needed, and in fact you will get the same estimates and standard errors with or without it. (There will be a change in the counts of subjects who leave or enter an interval, since an observation pair (0, 10), (10, 20) for the same subject will not count as an exit (censor) at 10 plus another entry at 10.) If (b) is true then the robust variance is called for, one will want to have either a \code{cluster} argument or the \code{robust=TRUE} argument. In the coxph routine a \code{cluster(group)} term in the model statement can be used instead of the cluster argument, but this is no longer the preferred form. When (b) and (c) are true then the \code{id} statement is required in order to obtain a correct \emph{estimate} of the result. This is also the case for (c) alone when subjects do not all start in the same state. For competing risks data --- multiple endpoints, everyone starts in the same state, only one transition per subject --- the \code{id} statement is not necessary nor (I think) is a robust variance. When there is an \code{id} statement but no \code{cluster} or \code{robust} directive, then the programs will use (b) as a litmus test to decide between model based or robust variance, if possible. (There are edge cases where only one of the two has been implemented). If there is a \code{cluster} argument then \code{robust=TRUE} is assumed. If only a \code{robust=TRUE} argument is given then the code treats each line of data as independent. \appendix \chapter{Changes from version 2.44 to 3.0} Not all of these may be completed by 3.0, but this is the roadmap. \section{Survfit} The survfit object is changing. The primary change has to do with the starting time. When the package was first written in the late 1980s I made the decision to \emph{not} include the initial point of the survival curve (time=0, S=1) as a part of the \code{time} and \code{survival} parts of the object; they could get tacked on by the plot function. With the addition of delayed start (\code{start.time} option) and then more importantly multi-state models, the starting point is not always a simple (0,1) pair, which led to the addition of new components to the objects to hold the additional information, and increasing if-then-else logic in the downstream routines that process the curves. In v3 the first point is now bundled in as part of the time, surv, pstate, std.err, and etc components. Two functions survfit23 and survfit32 convert between the version 2 \code{survfit} object and new \code{survfit3} forms. This has allowed us make any changes incrementally. Other changes are \begin{itemize} \item Common arguments of id, cluster, and influence \item The routines now produce both the estimated survival and the estimated cumulative hazard, along with their errors \item Some code paths produced std(S) and some std(log(S)), the object now contains a \code{log.se} flag telling which. (Before, downstream routines just ``had to know''). \item an explicit ``v3'' flag in the object \end{itemize} \section{Coxph} The multi-state objects include a \code{states} vector, which is a simple list of the state names. The \code{cmap} component is an integer matrix with one row for each term in the model and one column for each transition. Each element indexes a position in the coefficient vector and variance matrix. \begin{itemize} \item Column labels are of the form 1:2, which denotes a transition from \code{state[1]} to \code{state[2]}. \item If a particluar term in the data, ``age'' say, was not part of the model for a particular transition then a 0 will appear in that position of \code{cmap}. \item If two transitions share a common coefficient, both those element of \code{cmap} will point to the same location. \item The first row of \code{cmap}, labeled ``stratum'', is numbered separately and partitions the transtions into disjoint strata. \end{itemize} \end{document}