Home Others Time-to-Event-Datenanalyse





Web pages




This page briefly describes a number of questions to consider when analyzing time-to-event data, and includes an annotated list of resources for further information.


What is special about time-to-event (TTE) data?

Time-to-Event (TTE) data is unique in that the outcome of interest is not only whether or not an event occurred, but also when that event occurred. Conventional methods of logistic and linear regression are not suitable for including both the event and the time aspect as a result in the model. Traditional regression methods are also unsuitable for censorship, a special type of missing data that occurs in time-to-event analyzes when subjects do not experience the event of interest during the follow-up period. Given the censorship, the real time to the event is underestimated. Specific techniques have been developed for TTE data, as discussed below, to use the partial information on each subject with censored data and provide unbiased survival estimates. These techniques include data from multiple points in time across subjects and can be used to directly calculate rates, time ratios, and hazard ratios.

What are important methodological considerations about time-to-event data?

There are four main methodological considerations when analyzing time-to-event or survival data. It is important to have a clear definition of the target event, time origin and timescale, and describe how participants will exit the study. Once these are clearly defined, analysis becomes easier. Typically there is a single target event, but there are extensions to survival analysis that allow for multiple events or repeated events.

What is the time origin?

The time origin is the point in time at which the follow-up time begins. TTE data can use a variety of time origins, largely determined by study design, each with advantages and disadvantages. Examples are baseline time or baseline age. Time origins can also be determined by a defining characteristic, such as the start of exposure or diagnosis. This is often a natural choice if the outcome is related to this trait. Other examples are birth and calendar year. In cohort studies, the time scale is most often the study time.

Is there any option for the timescale other than study time?

Age is another commonly used timescale where the base age is the time origin and individuals step out at their event or censorship age. Models with age as a timescale can be adjusted for calendar effects. Some authors recommend using age rather than study time as the timescale as this can provide less biased estimates.

What is censorship

One of the specific challenges for survival analysis is that only a few people experienced the event at the end of the study and therefore survival times are unknown for a subset of the study group. This phenomenon is known as censorship and can occur in the following ways: the study participant has not yet experienced the relevant outcome such as relapse or death by the time the study is completed; the study participant will no longer be tracked during the study period; or the study participant experiences some other event that makes further examination impossible. Such censored interval times underestimate the true but unknown time until the event. Most analytical approaches assume that the censorship is random or non-informative.

There are three main types of censorship, right, left, and interval. If the events occur after the end of the study, the data are right-censored. Left censored data occurs when the event is observed, but the exact time of the event is unknown. Interval censored data occurs when the event is observed, but participants come and leave the observation so the exact time of the event is unknown. Most survival analytical methods are designed for right-censored observations, but methods for interval and left-censored data are available.

What's the question of interest?

The choice of analysis tool should be based on the research question. For TTE data, the research question can take several forms, which influences which survival function is most relevant to the research question. Three different types of research questions that may be of interest for TTE data are:

  1. What proportion of the people remain free from the event after a certain period of time?

  2. What percentage of people will the event have after a certain period of time?

  3. What is the risk of the event at any given point in time for those who survived by then?

Each of these questions corresponds to a different type of function used in survival analysis:

  1. Survival function, S (t): the probability that an individual will survive beyond time t [Pr (T> t)]

  2. Probability density function, F (t), or the cumulative incidence function, R (t): the probability that a person will have a survival time less than or equal to t [Pr (T≤t)]

  3. Hazard function, h (t): the current potential to experience an event at time t, provided that it has survived up to this time

  4. Cumulative hazard function, H (t): the integral of the hazard function from time 0 to time t, that of the area under curve h (t) between time 0 and time t. corresponds to

If one of these functions is known, the other functions can be calculated using the following formulas:

S (t) = 1 - F (t) The survival function and the probability density function add up to 1

h (t) = f (t) / S (t) The current risk is equal to the unconditional probability of

Experience of the event at time t, scaled by the fraction living at time t

H (t) = -log [S (t)] The cumulative hazard function is equal to the negative logarithm of survival


S (t) = e -H (t) The survival function is equal to the exponential negative cumulative risk


These conversions are often used in survival analysis procedures, as discussed below. In general, an increase in h (t), the current hazard, leads to an increase in H (t), the cumulative hazard, which is reflected in a decrease in S (t), the survival function.

What assumptions must be made to use standard time-to-event data techniques?

The main assumption in analyzing TTE data is that of non-informational censorship: people who are censored are as likely to experience a subsequent event as people who remain in the study. Informative censorship is analogous to non-ignorable missing data that skews analysis. There is no definitive way of testing whether censorship is not informative, although examining patterns of censorship can indicate whether it is reasonable to assume that censorship is not informative. If information censoring is suspected, sensitivity analyzes such as best-case and worst-case scenarios can be used to attempt to quantify the effect of information censoring on the analysis.

Another assumption when analyzing TTE data is that there is sufficient follow-up time and number of events for sufficient statistical significance. This needs to be taken into account in the study design phase as most survival analyzes are based on cohort studies.

It is worth mentioning other simplifying assumptions, as they are often made in overviews for survival analysis. Although these assumptions simplify survival models, they are not necessary to perform analyzes on TTE data. Advanced techniques can be used when these assumptions are violated:

  • No cohort effect on survival: For a cohort with a long recruitment period, assume that people who join early have the same chances of survival as those who join late

  • Right censoring in the data only

  • Events are independent of each other

What types of approaches can be used for survival analysis?

There are three main approaches to analyzing TTE data: non-parametric, semi-parametric, and parametric approaches. The choice of approach to use should be based on the research question of interest. Often, multiple approaches can be reasonably used in the same analysis.

What are non-parametric approaches to survival analysis and when are they appropriate?

Nonparametric approaches are not based on assumptions about the shape or form of parameters in the underlying population. Survival analysis uses nonparametric approaches to describe the data by estimating the survival function S (t) along with the median and quartiles of survival time. These descriptive statistics cannot be calculated directly from the data due to the censorship, as the actual survival time in censored subjects is underestimated, which leads to biased estimates of the mean, median and other descriptive information. Nonparametric approaches are often used as the first step in an analysis to generate unbiased descriptive statistics and are often used in conjunction with semi-parametric or parametric approaches.

Kaplan-Meier estimator

The most common nonparametric approach in the literature is the Kaplan-Meier (or product limit) estimator. The Kaplan-Meier estimator works by dividing the estimate of S (t) into a series of steps / intervals based on observed event times. Observations help estimate S (t) until the event occurs or until they are censored. For each interval, the probability of survival to the end of the interval is calculated, provided that the subjects are at risk at the beginning of the interval (this is usually noted as pj = (nj - dj) / nj). The estimated S (t) for each value of t is equal to the product of the survival of each interval up to and including time t. The main assumptions of this method, besides the non-informative censorship, are that censorship occurs after failure and that there is no cohort effect on survival, so that the subjects have the same probability of survival regardless of when they came into the study.

The estimated S (t) from the Kaplan-Meier method can be plotted as a step function over time on the X-axis. This graph is a nice way to visualize the cohort's experience of survival and can also be used to estimate the median (when S (t) ≤ 0.5) or quartiles of survival time. These descriptive statistics can also be calculated directly with the Kaplan-Meier estimator. 95% confidence intervals (CI) for S (t) rely on transformations of S (t) to ensure that the 95% CI is between 0 and 1. The most common method in the literature is the Greenwood estimator.

Life table estimator

The survival function life table estimator is one of the earliest examples of applied statistical methods and has been used to describe mortality in large populations for over 100 years. Estimating the life table is similar to the Kaplan-Meier method, except that the intervals are based on calendar time rather than observed events. Since life table methods are based on these calendar intervals and not on individual events / censoring times, these methods use the average amount of risk per interval to estimate S (t) and must assume that censoring occurs evenly over the calendar time interval. Because of this, the life table estimator is not as accurate as the Kaplan-Meier estimator, but the results will be similar for very large samples.

Nelson Aalen Estimator

Another alternative to Kaplan-Meier is the Nelson-Aalen estimator, which is based on the use of a counting method approach to estimate the cumulative hazard function H (t). The estimate of H (t) can then be used to estimate S (t). Estimates of S (t) derived using this method are always larger than the K-M estimate, but the difference between the two methods is small with large samples.

Can nonparametric approaches be used for univariable or multivariable analyzes?

Nonparametric approaches such as the Kaplan-Meier estimator can be used to perform univariate analyzes for categorical factors of interest. Factors must be categorical (either inherently or a categorized continuous variable) as the survival function S (t) is estimated for each level of the categorical variable and then compared across those groups. The estimated S (t) for each group can be plotted and compared visually.

Rank-based tests can also be used to statistically test the difference between survival curves. These tests compare the observed and expected number of events at each point in time across groups under the null hypothesis that survival functions are the same in all groups. There are several versions of these rank-based tests that differ in the weighting assigned to each point in time in calculating the test statistic. Two of the most common rank-based tests that can be found in the literature are the log-rank test, which gives the same weight to each point in time, and the Wilcoxon test, which weights each point in time according to the number of people at risk. Based on this weight, the Wilcoxon test is more sensitive to differences between the curves at the start of the follow-up examination when more subjects are at risk. Other tests, such as the Peto-Prentice test, use weights between those of the Log-Rank and the Wilcoxon test. Rank-based tests are subject to the additional assumption that censorship is group independent, and all are limited by low power to detect differences between groups when the survival curves cross. Although these tests provide a p-value of the difference between curves, they cannot be used to estimate effect sizes (however, the p-value of the log-rank test corresponds to the p-value for a categorical factor of interest in a univariable Cox model) .

Nonparametric models are limited in that they do not provide effect estimates and in general cannot be used to assess the effect of several factors of interest (multivariable models). For this reason, nonparametric approaches in epidemiology are often used in conjunction with semi- or fully parametric models, where multivariable models are typically used to control confounders.

Can Kaplan-Meier curves be adjusted?

It is a common myth that Kaplan-Meier curves cannot be fitted, and this is often cited as a reason for using a parametric model that can generate covariate-corrected survival curves. However, a method has been developed to generate fitted survival curves using inverse probability weighting (IPW). With only one covariate, IPWs can be estimated non-parametrically and correspond to a direct standardization of the survival curves for the study population. With multiple covariates, semi- or fully parametric models must be used to estimate the weights, which are then used to build multiple covariate-adjusted survival curves. Advantages of this method are that it is not subject to the proportional hazard assumption, it can be used for time-varying covariates and also for continuous covariates.

Why do we need parametric approaches for analyzing time-to-event data?

A non-parametric approach to the analysis of TTE data is used to simply describe the survival data in relation to the factor under study. Models that use this approach are also known as univariable models. More often, researchers are interested in the relationship between multiple covariates and the time to event. The use of semi- and fully parametric models enables the time to the event to be analyzed in relation to many factors simultaneously and provides estimates of the strength of the effect for each constituent factor.

What is a semi-parametric approach and why is it used so often?

The Cox proportional model is the most widely used multivariable approach to analyzing survival data in medical research. It is essentially a time-to-event regression model that describes the relationship between the frequency of events, as expressed by the hazard function, and a number of covariates. The Cox model is written as follows:

Hazard-Funktion, h(t) = h0(t)exp{β1X1 + β2X2 + … + βpXp}

It is considered a semi-parametric approach because the model contains a non-parametric component and a parametric component. The non-parametric component is the base risk h0 (t). This is the value of the danger when all covariates are equal to 0, which underlines the importance of centering the covariates in the model for interpretability. Do not confuse the baseline hazard with the risk at time 0. The baseline hazard function is not estimated parametrically, therefore, unlike most other statistical models, it does not assume that survival times have any particular statistical distribution and shape of the baseline follow follow danger is arbitrary. The baseline hazard function does not have to be estimated in order to draw conclusions about the relative hazard or the hazard ratio. This feature makes the Cox model more robust than parametric approaches because it is not prone to base risk misspecification.

The parametric component consists of the covariate vector. The covariate vector multiplies the baseline risk by the same amount regardless of time, so the effect of each covariate is the same at any point in time during the follow-up, and this is the basis for assuming the proportional hazards.

What is the proportional hazard assumption?

The proportional hazard assumption is critical to the use and interpretation of a Cox model.

With this assumption, there is a constant relationship between the outcome or the dependent variable and the covariate vector. The implications of this assumption are that the hazard functions for any two people are proportional at any point in time and the hazard rate does not change over time. In other words, if an individual is at twice as high a risk of death as another individual at any initial point in time, then the risk of death remains twice as high at all later points in time. This assumption implies that the hazard curves for the groups should be proportional and should not cross each other. Since this assumption is so important, it is important to test it.

How do you test the proportional hazard assumption?

There are a variety of techniques, both graphical and test-based, to assess the validity of the proportional hazard assumption. One technique is to simply draw Kaplan-Meier survival curves when comparing two groups with no covariates. If the curves cross, the proportional hazards assumption can be violated. One important caveat to note about this approach for small studies. There can be large errors in estimating survival curves for studies with a small sample size, so the curves can cross even if the proportional hazards assumption is met. The complementary log-log plot is a more robust test that plots the log of the negative log of the estimated survival function versus the log of survival time. If the hazards are proportional across the groups, this diagram gives parallel curves. Another common way to test the proportional hazard assumption is to include a time interaction term to determine if HR changes over time, since time is often the culprit for the disproportionate nature of the exposure. Evidence that the group-time interaction term is not zero speaks against proportional dangers.

What if the proportional hazard assumption does not apply?

If you find that the PH assumption does not apply, you do not necessarily have to forego using the Cox model. There are options to improve the non-proportionality in the model. For example, you can include other covariates in the model, either new covariates, nonlinear terms for existing covariates, or interactions between covariates. Or you can stratify the analysis by one or more variables. This estimates a model in which the baseline exposure may be different within each stratum, but the covariate effects are the same across all strata. Other options include dividing time into categories and using indicator variables to allow hazard rates to change over time and changing the analysis time variable (e.g., from elapsed time to age or vice versa).

How do you study semi-parametric model fitting?

In addition to checking for violations of the assumption of proportionality, other aspects of the model adaptation must be checked. Statistics similar to those used in linear and logistic regression can be used to perform these tasks for Cox models with some differences, but the basic ideas are the same in all three settings. It is important to check the linearity of the covariate vector, which can be done by examining the residuals as with linear regression. However, residuals in TTE data are not as straightforward as they are in linear regression, in part because the value of the result for some of the data is unknown and the residuals are often skewed. Several different types of residuals have been developed to assess the suitability of the Cox model for TTE data. Examples include Martingale and Schönfeld. You can also look at the residuals to identify very influential and poorly fitted observations. There are also goodness of fit tests specific to Cox models, such as the Gronnesby and Borgan tests and the Hosmer and Lemeshow prognostic indexes. You can also use the AIC to compare different models, although using R2 is problematic.

Why use a parametric approach?

One of the main advantages of semi-parametric models is that the baseline risk does not have to be specified in order to estimate hazard ratios, which describe differences in relative risk between groups. However, the initial risk estimate itself may be of interest. In this case, a parametric approach is required. With parametric approaches, both the hazard function and the effect of the covariates are specified. The hazard function is estimated based on an assumed distribution in the underlying population.

Advantages of using a parametric approach to survival analysis are:

  • Parametric approaches are more informative than non- and semi-parametric approaches. In addition to calculating relative effect estimates, they can also be used to predict survival time, hazard rates, and mean and median survival time. They can also be used to make absolute risk predictions over time and to draw covariate-adjusted survival curves.

  • If the parametric form is specified correctly, parametric models are more informative than semi-parametric models. They are also more efficient, which results in smaller standard errors and more accurate estimates.

  • Parametric approaches rely on the full maximum probability to estimate parameters.

  • Parametric model residuals take the well-known form of the difference between what is observed and what is expected.

The main disadvantage of using a parametric approach is that it relies on the assumption that the underlying population distribution has been correctly specified. Parametric models are not robust against incorrect specifications, which is why semi-parametric models are more common in the literature and are less risky to use when there is uncertainty about the underlying population distribution.

How to choose the parametric form

Choosing the appropriate parametric shape is the hardest part of parametric survival analysis. The specification of the parametric shape should be determined by the study hypothesis together with previous knowledge and biological plausibility of the shape of the base risk. For example, if it is known that the risk of death increases dramatically immediately after surgery and then decreases and flattens out, it would be inappropriate to give the exponential distribution that assumes a constant risk over time. The data can be used to judge whether the given shape appears to fit the data, but these data-driven methods should complement, not replace, hypothesized selections.

What is the difference between a proportional hazard model and an accelerated downtime model?

enterprise employee discounts website

Although Cox's proportional hazard model is semi-parametric, proportional hazard models can also be parametric. Parametric proportional hazard models can be written as follows:

h(t,X) = h0(t)exp(Xi β) = h0(t)λ

where the baseline hazard function h0 (t) depends only on the time t but not on X and is often a unit-specific function of covariates that does not depend on t, which scales the baseline hazard function up or down. λ cannot be negative. In this model, the hazard rate is a multiplicative function of the baseline hazard and the hazard ratios can be interpreted in the same way as in the semi-parametric proportional hazard model.

Accelerated Failure Time (AFT) models are a class of parametric survival models that can be linearized by taking the natural logarithm of the survival time model. The simplest example of an AFT model is the exponential model, which is written as follows:

ln (T) = β0 + β1X1 +… + βpXp + ε *

The main difference between AFT models and PH models is that AFT models assume that the effects of covariates on the timescale are multiplicative, while Cox models use the hazard scale shown above. Parameter estimates from AFT models are interpreted as effects on the time scale that can either speed up or slow down survival time. Exp (β)> 1 from an AFT model means that the factor accelerates survival time or leads to longer survival. Exp (β)<1 decelerates survival time (shorter survival). AFT models assume that estimated time ratios are constant across the time scale. A time ratio of 2, for example, can be interpreted as the median time to death in group 1 is double the median time to death in group 2 (indicated longer survival for group 1).

Some error distributions can be written and interpreted as both PH and AFT models (i.e. exponential, Weibull), others are just PH (i.e. Gompertz) or just AFT models (i.e. log-logistic) and others are neither PH- still AFT models (i.e. fitting a wedge).

What forms can parametric models take?

The hazard function can take any form as long as h (t)> 0 holds for all values ​​of t. While the primary consideration for the parametric shape should be prior knowledge of the shape of the basis risk, each distribution has its own advantages and disadvantages. Some of the more common forms are briefly explained, and more information can be found in the resource list.


The exponential distribution assumes that h (t) only depends on model coefficients and covariates and is constant over time. The main advantage of this model is that it is both a proportional hazard model and an accelerated downtime model, so effect estimates can be interpreted as either hazard ratios or time ratios. The main disadvantage of this model is that it is often implausible to assume a constant risk over time.

Weibull distribution

The Weibull distribution is similar to the exponential distribution. While the exponential distribution assumes a constant risk, the Weibull distribution assumes a monotonous risk, which can be either increasing or decreasing, but not both. It has two parameters. The shape parameter (σ) controls whether the danger increases (σ1) (this parameter is set to 1 in the exponential distribution). The scaling parameter (1 / σ) exp (-β0 / σ) determines the scale of this increase / decrease. Since the Weibull distribution simplifies to the exponential distribution at σ = 1, the null hypothesis that = 1 can be tested with a Wald test. The main advantage of this model is that it is both a PH and AFT model so that both hazard ratios and time ratios can be estimated. Here, too, the main disadvantage is that the assumption that the basic risk is monotonous may in some cases be implausible.

Gompertz sales

The Gompertz distribution is a PH model that is equal to the log Weibull distribution, so the logarithm of the hazard function in t is linear. This distribution has an exponentially increasing default rate and is often useful for actuarial data because the risk of death also increases exponentially over time.

Log-logistic distribution

The log-logistic distribution is an AFT model with an error term that follows the standard logistic distribution. It cannot be adjusted to monotonous hazards and generally fits best when the underlying hazard peaks and then falls, which may be plausible for certain diseases like tuberculosis. The log-logistic distribution is not a PH model, but a proportional odds model. This means that it is subject to the proportional odds assumption, but the advantage is that slope coefficients can be interpreted as time ratios and also as odds ratios. For example, an odds ratio of 2 from a parametric log-logistic model would be interpreted in such a way that the chances of survival beyond time t for subjects with x = 1 are twice the odds for subjects with x = 0.

Generalized gamma (GG) distribution

The generalized gamma (GG) distribution is actually a family of distributions that contains almost all of the most commonly used distributions, including the exponential, Weibull, log-normal, and gamma distributions. This enables comparisons between the different distributions. The GG family also includes all four of the most common types of hazard functions, which makes the GG distribution particularly useful as the shape of the hazard function can help optimize model selection.

Splines approach

Since the only general limitation in the specification of the baseline hazard function is that h (t)> 0 for all values ​​of t, splines can be used for maximum flexibility in modeling the shape of the baseline hazard. Constrained cubic splines is a method recently recommended in the literature for parametric survival analysis because it allows flexibility in shape but limits function to linearity at ends where the data is sparse. Splines can be used to improve the estimate and are also useful for extrapolation as they maximize the fit to the observed data. If specified correctly, the effect estimates of models fitted using splines should not be skewed. As with other regression analysis, the challenges of fitting splines can include choosing the number and location of nodes and problems with overfitting.

How do you study parametric model fitting?

The most important component in assessing the parametric model fit is verifying that the data supports the specified parametric shape. This can be assessed visually by graphing the model-based cumulative hazard versus the Kaplan-Meier estimated cumulative hazard function. If the given shape is correct, the graph should go through the origin with a slope of 1. The Grønnesby-Borgan goodness-of-fit test can also be used to check whether the observed number of events deviates significantly from the expected number of events in groups differentiated by risk score. This test is very sensitive to the number of groups selected and tends to be too generous in rejecting the null reasonable fit hypothesis when many groups are selected, especially with small data sets. However, the test lacks the power to detect model violations when too few groups are selected. For this reason, it seems unwise to rely solely on a goodness test to determine whether the given parametric shape is appropriate.

AIC can also be used to compare models run with different parametric shapes, with the lowest AIC indicating the best fit. However, AIC cannot be used to compare parametric and semi-parametric models because parametric models are based on observed event times and semi-parametric models are based on the order of event times. Again, these tools should be used to check that the specified shape fits the data, but the plausibility of the specified underlying hazard is still the most important consideration when choosing a parametric shape.

Once the specified parametric shape is well suited to the data, methods similar to those previously described for semi-proportional hazard models can be used to choose between different models, such as: B. Residual plots and goodness-of-fit tests.

What if predictors change over time?

In the model statements written above, we assumed that the exposures are constant over the course of the follow-up observation. Exposures with values ​​that change over time or with time-varying covariates can be included in survival models by changing the unit of analysis from the individual to the period of constant exposure. This divides the person time of individuals into intervals that each person contributes to the risk set of exposed and not exposed for these covariates. The main assumption in including a time-varying covariate in this way is that the effect of the time-varying covariate is not time dependent.

For a Cox proportional hazard model, the inclusion of a time-varying covariate would be of the form: h (t) = h0 (t) e ^ β1x1 (t). Time-varying covariates can also be included in parametric models, but this is a little more complicated and difficult to interpret. Parametric models can also model time-varying covariates using splines for greater flexibility.

In general, time-varying covariates should be used when it is assumed that the risk is more dependent on later values ​​of the covariates than on the value of the covariate at the start of the study. Challenges that arise with time-varying covariates are lack of data on the covariate at different points in time and a potential bias in assessing the risk if the time-varying covariate is actually a mediator.

What is Competing Risk Analysis?

Traditional methods of survival analysis assume that only one type of event of interest occurs. However, there are more advanced methods to allow multiple types of events to be investigated in the same study, such as: B. Death from multiple causes. Competitive risk analysis is used for these studies, in which survival is terminated by the first of several events. Special methods are required because analyzing the time to each event separately can be biased. It is precisely in this context that the KM method tends to overestimate the proportion of subjects who experience events. Competitive risk analysis uses the cumulative incidence method, in which the total event probability at any point in time is the sum of the event-specific probabilities. The implementation of the models usually takes place through the multiple entries of each study participant - once per event type. For each study participant, the time up to an event is censored to the point in time when the patient experienced the first event. For more information, please visit advancedepidemiology.org competing risks .

What are Frailty Models and why are they useful for correlated data?

Correlated survival data can arise from recurring events that a person experiences or when observations are grouped together. Either because of a lack of knowledge or for reasons of feasibility, some covariates related to the event of interest may not be able to be measured. Frailty models take into account the heterogeneity caused by unmeasured covariates by adding random effects that have a multiplicative effect on the hazard function. Frailty models are essentially extensions of the Cox model with the addition of random effects. Although there are various classification schemes and nomenclatures used to describe these models, four common types of frailty models include common, nested, common, and additive frailty.

Are there other approaches to analyzing recurring event data?

Recurring event data is correlated because multiple events can occur within the same subject. While frailty models are one way to account for this correlation in repetitive event analysis, a simpler approach that can also account for this correlation is to use robust standard errors (SE). By adding robust SEs, repetitive event analysis can be performed as a simple extension of either semiparametric or parametric models.

Although easy to implement, there are several ways to model recurring event data with robust SEs. These approaches differ in how they define the risk for each recurrence. That way, they'll answer slightly different study questions, so the choice of modeling approach to use should be based on the study hypothesis and the validity of the modeling assumptions.

The counting process, or Andersen-Gill approach to modeling recurring events, assumes that each repetition is an independent event and does not take into account the order or type of event. In this model, the follow-up time for each subject begins at the beginning of the study and is divided into segments that are defined by events (relapses). Subjects contribute to the risk rate for an event as long as they are under surveillance (not censored) at that point in time. These models can easily be fitted as a Cox model with a robust SE estimator, and hazard ratios are interpreted as the effect of the covariates on the recurrence rate over the follow-up period. However, this model would be inappropriate if the assumption of independence is inappropriate.

Conditional approaches assume that a subject is not at risk for a subsequent event until a previous event occurs, and therefore consider the order of events. They are fitted using a stratified model, using the event number (or in this case the number of iterations) as the stratified variable and including robust SEs. There are two different conditional approaches that use different time scales and therefore have different rates of risk. The conditional probability approach uses the time since the study began to define the time intervals and is reasonable when the interest lies throughout the course of the recurring event process. The gap-time approach essentially resets the clock for each iteration, using the time since the previous event to define time intervals, and is more appropriate when event (or iteration) specific effect estimates are of interest.

Finally, marginal approaches (also known as the WLW - Wei, Lin, and Weissfeld approach) view each event as a separate process, so subjects are at risk for all events from the start of the follow-up, regardless of whether they had a previous event. This model is suitable if it is assumed that the events result from different underlying processes, so that, for example, a person could experience a 3rd event without experiencing the 1st. Although this assumption seems implausible with some types of data, such as cancer recurrence, it could be used to model injury recurrence over time if subjects could experience different types of injuries over time that are out of natural order to have. Marginal models can also be matched with layered models with robust SEs.


This project aimed to describe the methodological and analytical choices that one might encounter when working with time-to-event data, but it is by no means exhaustive. Resources are provided below to dig deeper into these topics.

Textbooks & Chapters

Vittinghoff E., Glidden DV, Shiboski SC, McCulloch CE (2012). Regression Methods in Biostatistics, 2. New York, NY: Springer.

  • Introductory text on linear, logistic, survival, and repeated measures models, best for those looking for a basic starting point.

  • The Survival Analysis chapter offers a good overview, but no depth. Examples are STATA-based.

Hosmer DW, Lemeshow S, May S. (2008) Applied survival analysis: regression modeling of time-to-event data, 2nd ed. Hoboken, New Jersey: John Wiley & Sons, Inc.

  • In-depth overview of non-parametric, semi-parametric, and parametric Cox models, best for those familiar with other areas of statistics. Advanced techniques are not covered in detail, but references to other specialist textbooks are provided.

Kleinbaum DG, Klein M (2012). Survival Analysis: A Self-Learning Text, 3rd Edition New York, NY: Springer Science + Business Media, LLC

  • Excellent introductory text

Klein JP, Moeschberger ML (2005). Survival Analysis: Techniques for Censored and Truncated Data, 2nd Ed. New York, NY: Springer Science + Business Media, LLC

  • This book is designed for PhD students and offers many practical examples

Therneau ™, Grambsch PM (2000). Modeling Survival Data: Extension of the Cox Model. New York, NY: Springer Science + Business Media, LLC

  • Good introduction to the counting process approach and analyzing correlated survival data. The author also got the survival kit in R. written

Allison PD (2010). Survival Analysis with SAS: A Practice Guide, 2nd Ed. Cary, NC: SAS Institute

  • Great application text for SAS users

Bagdonavicius V., Nikulin M (2002). Accelerated life models: modeling and statistical analysis. Boca Raton, FL: Chapman & Hall / CRC Press.

  • Good source for more information on parametric and semi-parametric accelerated downtime models and how they compare to proportional hazard models

Methodical articles

Introductory / review article

Hougaard P. (1999). Survival data basics. Biometrics 55 (1): 13-22. PMID: 11318147 .

Clark TG, Bradburn MJ, Love SB, Altman DG (2003). Survival Analysis Part I: Basic Concepts and Initial Analyzes. Br. J. Krebs 89 (2): 232-8. PMID: 12865907

Clark TG, Bradburn MJ, Love SB, Altman DG (2003). Survival Analysis Part II: Multivariate Data Analysis - An Introduction to Concepts and Methods. Br. J. Krebs 89 (3): 431-6. PMID: 1288808

Clark TG, Bradburn MJ, Love SB, Altman DG (2003). Survival Analysis Part II: Multivariate Data Analysis - Selecting a Model and Assessing Its Appropriateness and Suitability. Br. J. Krebs 89 (4): 605-11. PMID: 12951864

Clark TG, Bradburn MJ, Love SB, Altman DG (2003). Survival Analysis Part IV: Further Survival Analysis Concepts and Methods. Br. J. Krebs 89 (5): 781-6. PMID: 12942105

  • The above series of four articles is an excellent introductory overview of survival analysis methods that is very well written and easy to understand - it is highly recommended.

Age as a time scale

Korn EL, Graubard BI, Midthune D (1997). Time-to-event analysis of the longitudinal follow-up of a survey: choice of time frame. Am J. Epidemiol 145 (1): 72-80. PMID: 8982025

  • Paper advocating the use of age as a timescale rather than study time.

Ingram DD, Makuc DM, Feldman JJ (1997). Subject: time-to-event analysis of longitudinal follow-up of a survey: choice of time frame. Am J. Epidemiol 146 (6): 528-9. PMID: 9290515 .

  • Comment on the Korn paper describing the precautions to be taken when using age as a timescale.

Thiébaut AC, Bénichou J (2004). Choice of timescale in Cox's model analysis of epidemiological cohort data: a simulation study. Stat. Med 30; 23 (24): 3803-20. PMID: 15580597

  • Simulation study showing the extent of bias for various degrees of association between age and the covariate of interest when using study time as a timescale.

Canchola AJ, Stewart SL, Bernstein L, et al. Cox regression with different time scales. Available at: http://www.lexjansen.com/wuss/2003/DataAnalysis/i-cox_time_scales.pdf .

  • A nice paper that compares 5 Cox regression models with variations in study time or age as a timescale with SAS code.


Huang CY, Ning J, Qin J (2015). Semiparametric likelihood inference for left-shortened and right-censored data. Biostatistics [epub] PMID: 25796430 .

  • This paper provides a nice introduction to the analysis of censored data and offers a new estimation method for the survival time distribution with left and right censored data. It's very dense and has an advanced statistical focus.

Cain KC, Harlow SD, Little RJ, Nan B, Yosef M, Taffe JR, Elliott MR (2011). Distortion through left truncation and left censorship in longitudinal studies of developmental and disease processes. Am J. Epidemiol 173 (9): 1078-84. PMID: 21422059 .

  • An excellent resource that explains the biases inherent in left-censored data from an epidemiological point of view.

Sun J, Sun L, Zhu C (2007). Testing the proportional quota model for interval censored data. Lifetime Data Anal 13: 37–50. PMID 17160547 .

  • Another statistically dense article on a nuanced aspect of TTE data analysis, but which provides a good explanation for interval censored data.

Robins JM (1995a) An Analytical Method for Randomized Trials with Informative Censorship: Part I. Lifetime Data Anal 1: 241-254. PMID 9385104 .

Robins JM (1995b) An Analytical Method for Randomized Trials with Informative Censorship: Part II. Lifetime Data Anal 1: 417-434. PMID 9385113 .

  • Two articles discussing methods of dealing with information censorship.

Non-parametric survival methods

Borgan Ø (2005) Kaplan-Meier estimator. Encyclopedia of Biostatistics DOI: 10.1002 / 0470011815.b2a11042

  • Excellent overview of the Kaplan-Meier estimator and its relationship to the Nelson-Aalen estimator

Rodríguez G (2005). Nonparametric Estimation in Survival Models. Available from: http://data.princeton.edu/pop509/NonParametricSurvival.pdf

  • Introduction to nonparametric methods and the Cox proportional hazard model, which explains the relationships between methods with the mathematical formulas

Cole SR, Hernan MA (2004). Fitted survival curves with inverse probability weights. Comput Methods Programs Biomed 75 (1): 35-9. PMID: 15158046

  • Describes how to use IPW to create fitted Kaplan-Meier curves. Includes an example and a SAS macro.

Zhang M. (2015). Robust methods for improving the efficiency and reducing bias in estimating survival curves in randomized clinical trials. Lifetime data Anal 21 (1): 119-37. PMID: 24522498

  • Proposed method for covariate-adjusted survival curves in RCTs

Semiparametric survival methods

Cox DR (1972) regression models and life tables (with discussion). JR Statist Soc B 34: 187-220.

  • The classic reference.

Christensen E (1987) Multivariate survival analysis using the Cox regression model. Hepatology 7: 1346-1358. PMID 3679094 .

  • Describes the application of the Cox model using a motivating example. Excellent overview of the most important aspects of Cox model analysis, including fitting a Cox model and reviewing model assumptions.

Grambsch PM, Therneau TM (1994) Proportional Hazard Tests and Diagnostics Based on Weighted Residuals. Biometrics 81: 515-526.

  • A detailed paper on testing the proportional hazard assumption. Good mix of theory and advanced statistical explanation.

Ng’andu NH (1997) An empirical comparison of statistical tests to evaluate the proportional hazard assumption of the Cox model. Statistics Med 16: 611-626. PMID 9131751 .

  • Another extensive paper on testing the proportional hazard assumption. This includes a discussion of the review of residuals and effects of censorship.

Parametric survival methods

Rodrίguez, G (2010). Parametric survival models. Available from: http://data.princeton.edu/pop509/ParametricSurvival.pdf

  • Brief introduction to the most common distributions used in parametric survival analysis

Nardi A, Schemper M (2003). Comparison of Cox and parametric models in clinical trials. Stat Med 22 (23): 2597-610. PMID: 14652863

  • Provides good examples of comparing semi-parametric models with models that use common parametric distributions, and focuses on evaluating the model fit

Royston P., Parmar M.K. (2002). Flexible parametric proportional hazards and proportional odds models for censored survival data with application to prognostic modeling and estimation of treatment effects. Statistics Med 21 (15): 2175-97. PMID: 12210632

  • Good explanation for the basics of proportional hazards and odds models and comparisons with cubic splines

Cox C, Chu H, Schneider MF, Muñoz A (2007). Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Statistics Med 26: 4352-4374. PMID 17342754 .

  • Provides an excellent overview of parametric survival methods, including a taxonomy of hazard functions and an in-depth discussion of the generalized gamma distribution family.

Crowther MJ, Lambert PC (2014). A general framework for parametric survival analysis. Stat Med 33 (30): 5280-97. PMID: 25220693

  • Describes restrictive assumptions about commonly used parametric distributions and explains the constrained cubic spline method

Sparling YH, Younes N., Lachin JM, Bautista OM (2006). Parametric survival models for interval-censored data with time-dependent covariates. Biometrics 7 (4): 599-614. PMID: 16597670

  • Extension and example for the use of parametric models with interval censored data

Zeitvariable Kovariaten

Fisher LD, Lin DY (1999). Time-dependent covariates in the Cox proportional hazards regression model. Annu Rev. Public Health 20: 145-57. PMID: 10352854

  • Thorough and easy to understand explanation of time-varying covariates in Cox models with a mathematical appendix

Petersen T. (1986). Adaptation of parametric survival models with time-dependent covariates. Appl-Statist 35 (3): 281-88.

  • Dense article, but with a useful example of use

Competing risk analysis

more than half of all u.s. families living in poverty are

See Competing Risks

Tai B, Machin D, White I, Gebski V (2001) Competitive risk analysis of patients with osteosarcoma: a comparison of four different approaches. Statistics Med 20: 661-684. PMID 11241570 .

  • Good in-depth article that describes four different methods of analyzing competing risk data and uses data from a randomized study of patients with osteosarcoma to compare these four approaches.

Checkley W, Brower RG, Muñoz A (2010). Inference for mutually exclusive competing events through a mixture of generalized gamma distributions. Epidemiology 21 (4): 557-565. PMID 20502337 .

  • Competing Risk Paper Using Generalized Gamma Distribution.

Analysis of clustered data and frailty models

Yamaguchi T, Ohashi Y, Matsuyama Y (2002) Proportional hazard models with random effects for the investigation of center effects in multicenter clinical cancer studies. Statistical Methods Med Res 11: 221-236. PMID 12094756 .

  • A paper with excellent theoretical and mathematical explanation for considering clustering when analyzing survival data from multicenter clinical studies.

O’Quigley J, Stare J (2002) Proportional hazard models with frailty and random effects. Statistics Med 21: 3219-3233. PMID 12375300 .

  • A head-to-head comparison of frailty models and random effects models.

Balakrishnan N, Peng Y (2006). Generalized Gamma Frailty Model. Statistics Med 25: 2797–2816. PMID

  • An article on frailty models that use the generalized gamma distribution as the frailty distribution.

Rondeau V, Mazroui Y, Gonzalez JR (2012). frailtypack: An R package for analyzing correlated survival data with frailty models using Penalized Likelihood Estimation or Parametrical Estimation. Statistical Software Journal 47 (4): 1-28.

  • R parcel vignette with good background information on frail models.

Schaubel DE, Cai J (2005). Analysis of clustered recurring event data with application to hospitalization rates in patients with renal insufficiency. Biostatistics 6 (3): 404-19. PMID 15831581 .

  • Excellent paper in which the authors present two methods of analyzing clustered data on recurring events and then compare the results of the proposed models with those based on a frailty model.

Gharibvand L, Liu L (2009). Analysis of survival data with clustered events. SAS Global Forum 2009 paper 237-2009.

  • Concise and easy-to-understand source for analyzing time-to-event data with clustered events using SAS methods.

Analysis of recurring events

Twisk JW, Smidt N., de Vente W (2005). Applied Analysis of Recurring Events: A Practical Overview. J Epidemiol Community Health 59 (8): 706-10. PMID: 16020650

  • Very easy to understand introduction to the modeling of recurring events and the concept of risk phrases

Villegas R, Juliá O, Ocaña J (2013). Empirical study of the correlated survival times for recurring events with proportional risk margins and the effect of correlation and censorship. BMC Med Res Methodol 13:95. PMID: 23883000

  • Uses simulations to test the robustness of various models for recurring event data

Kelly PJ, Lim LL (2000). Survival analysis for recurring event data: an application to infectious diseases in children. Statistics Med 19 (1): 13-33. PMID: 10623190

  • Applied examples of the four main approaches to modeling recurring event data

Wei LJ, Lin DY, Weissfeld L (1989). Regression analysis of multivariate incomplete downtime data by modeling marginal distributions. Journal of the American Statistical Association84 (108): 1065-1073

The original article describes marginal models for analyzing recurring event events


Summer Institute for Epidemiology and Population Health at Columbia University (EPIC)

Statistical Horizons, private provider of specialized statistical seminars taught by experts in the field

Inter-University Consortium for Political and Social Research (ICPSR) Summer Program in Quantitative Methods of Social Research, part of the Institute for Social Research at the University of Michigan

  • 3-day seminar on survival analysis, event history modeling, and endurance analysis offered June 22-24, 2015 in Berkeley, California, held by Tenko Raykov of Michigan State University. Comprehensive overview of survival methods in all disciplines (not just public health): http://www.icpsr.umich.edu/icpsrweb/sumprog/courses/0200

The Institute for Statistical Research offers two online courses on survival analysis that are offered several times a year. These courses are based on the textbook Applied Analysis of Klein and Kleinbaum (see below) and can be taken à la carte or as part of a certificate program in statistics:

The Institute for Digital Research and Education at UCLA offers so-called seminars on survival analysis in various statistical programs via its website. These seminars demonstrate how applied survival analysis is performed with an emphasis more on code than theory.

Interesting Articles