How to Conduct a Clone Censor-Weight Survival Analysis using survivalCCW

Clone Censor Weighting

This lightweight package describes how to conduct clone-censor weighting (CCW) to address the problem of immortal time bias in survival analysis. This vignette will walk through the applied tutorial published by Maringe et al 2020. Refer to Gaber et al 2024 for more details on CCW in practice and to Hernan and Robins 2016 for more technical details.

Context

CCW is useful in the presence of immortal person-time bias in observational studies. For instance, when comparing surgery recipients vs non-recipients in non-small cell lung cancer (NSCLC), the surgery group will have a longer survival time than the non-surgery group because the non-surgery group includes patients who died before they could receive surgery. This is a form of immortal time bias.

The CCW toy dataset published by Maringe uses this exact setting as the motivating example. Let’s explore the dataset, which comes with survivalCCW.

library(survivalCCW)
library(DT) # For better printing of data.frames
data(dummy_data)

head(dummy_data) |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = FALSE
    )
  )

Search:

id	timetoexposure	exposure	event	timetoevent	cov1	cov2
pat3462		0	0	47	1	31.47424700318123
pat3510	30	1	1	112	1	54.85926473516857
pat9717	28	1	0	226	0	45.28087406719721
pat3985		0	0	71	1	58.46738591048707
pat2841	56	1	1	324	1	59.25770592031112
pat4370		0	1	319	0	46.79169419060145

Showing 1 to 6 of 6 entries

Column descriptions can be found with ?dummy_data:

id: Patient identifier
timetoexposure: Time to exposure, continuous, NA for non-exposed patients
exposure: Binary, 1 = exposed, 0 = not exposed
event: Event occurrence, binary, 1 = event, 0 = censored
timetoevent: Time to event, continuous, capped at 365
cov1: Binary covariate - positively associated with both time-to-exposure and time-to-event
cov2: Continuous covariate - positively associated with both time-to-exposure and time-to-event

Note that this package addresses situations in which the covariates are all defined at baseline.

Create clones

The first step is to create the clones. This can be done for any time-to-event outcome using the survivalCCW function create_clones. For create_clones to work, we need to pass a one-row-per-patient data.frame with the following columns:

The id variable (in this case, id)
The traditional outcome variable which denotes censorship (0) or event (1) (in this case, event). Note that additional values are not yet permitted.
The time to the first event (in this case, timetoevent)
The exposure variable, with exposure defined at any time prior to censorship/event (in this case, exposure). Must be (0) or (1),
The time to exposure variable (in this case, timetoexposure)
The clinical eligibility window (in this case, we’ll do 100 time units or days)

All other columns will be propogated for each patient. Let’s see what this looks like in practice.


# Create clones
clones <- create_clones(
  df = dummy_data,
  id = 'id',
  event = 'event',
  time_to_event = 'timetoevent',
  exposure = 'exposure',
  time_to_exposure = 'timetoexposure',
  ced_window = 100
)
#> Updating 4 patients' exposure and time-to-exposure based on CED window

head(clones) |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = FALSE
    )
  )

Search:

id	timetoexposure	exposure	event	timetoevent	cov1	cov2	outcome	fup_outcome	censor	fup_censor	clone
pat1040		0	1	211	1	50.53451332153781	1	211	0	100	0
pat1040		0	1	211	1	50.53451332153781	0	100	1	100	1
pat1150	13	1	1	320	1	52.4650280454309	0	13	1	13	0
pat1150	13	1	1	320	1	52.4650280454309	1	320	0	13	1
pat1184		0	1	218	0	48.39002469555063	1	218	0	100	0
pat1184		0	1	218	0	48.39002469555063	0	100	1	100	1

Showing 1 to 6 of 6 entries

Note that this object is just a data.frame with an additional custom class which future functions will evaluate:

class(clones)
#> [1] "ccw_clones" "data.frame"

You can visualize the censoring over time after you create the clones:

plot_censoring_over_time(clones)

Cast to long format

Now we simply need to cast the data to long format. The survivalCCW function cast_to_long will do this for us. No additional arguments are needed (the clones object is an artifact that allows you to better see and understand the method):

clones_long <- cast_clones_to_long(clones)

head(clones_long, row.names = FALSE) |>
  DT::datatable(
      rownames = FALSE, 
      options = list(
        scrollX = TRUE,
        paging = FALSE
      )
    )

Search:

time_id	t_start	id	event	timetoevent	cov1	cov2	fup_censor	fup_outcome	t_stop	t_event
0	0	pat1040	1	211	1	50.53451332153781	100	7	7
1	7	pat1040	1	211	1	50.53451332153781	100	8	8	7
2	8	pat1040	1	211	1	50.53451332153781	100	9	9	8
3	9	pat1040	1	211	1	50.53451332153781	100	10	10	9
4	10	pat1040	1	211	1	50.53451332153781	100	11	11	10
5	11	pat1040	1	211	1	50.53451332153781	100	12	12	11

Showing 1 to 6 of 6 entries

Let’s pick out a single patient and look at their data:

clones_long[clones_long$id == "P5913", ] |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = TRUE
    )
  )

Show entries

Search:

time_id	t_start	id	timetoexposure	exposure	event	timetoevent	cov1	cov2	censor	fup_censor	clone	fup_outcome	outcome	t_stop	t_event
No data available in table

Showing 0 to 0 of 0 entries

PreviousNext

Generate weights

Now we simply need to generate the weights. The survivalCCW function generate_ccw() will do this for us.

clones_long_weights <- generate_ccw(clones_long, predvars = c("cov1", "cov2"))

head(clones_long_weights) |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = FALSE
    )
  )

Search:

time_id	t_start	id	event	timetoevent	cov1	cov2	fup_censor	fup_outcome	t_stop	t_event	lp	t	hazard	p_uncens	weight_cox
0	0	pat1040	1	211	1	50.53451332153781	100	7	7		-0.02544224035305076		0	1	1
1	7	pat1040	1	211	1	50.53451332153781	100	8	8	7	-0.02544224035305076	7	0.006105979314119561	0.9940650924669409	1.005970340954565
2	8	pat1040	1	211	1	50.53451332153781	100	9	9	8	-0.02544224035305076	8	0.02465368245296561	0.9762521753546439	1.024325502411023
3	9	pat1040	1	211	1	50.53451332153781	100	10	10	9	-0.02544224035305076	9	0.0309012378852502	0.9703242760122278	1.030583305727165
4	10	pat1040	1	211	1	50.53451332153781	100	11	11	10	-0.02544224035305076	10	0.03717191800415368	0.9644106297675454	1.036902714605119
5	11	pat1040	1	211	1	50.53451332153781	100	12	12	11	-0.02544224035305076	11	0.04348089827983864	0.9584972352258793	1.043299827322236

Showing 1 to 6 of 6 entries

Let’s pick out a single patient and look at their data:

clones_long_weights[clones_long_weights$id == "pat3462", ] |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = TRUE
    )
  )

Show entries

Search:

time_id	t_start	id	timetoevent	cov1	cov2	fup_censor	fup_outcome	t_stop	t_event	lp	t	hazard	p_uncens	weight_cox
0	0	pat3462	47	1	31.47424700318123	47	7	7		0.1544587838435645		0	1	1
1	7	pat3462	47	1	31.47424700318123	47	8	8	7	0.1544587838435645	7	0.006105979314119561	0.9928994908890891	1.007151286888618
2	8	pat3462	47	1	31.47424700318123	47	9	9	8	0.1544587838435645	8	0.02465368245296561	0.9716384655102199	1.029189390392122
3	9	pat3462	47	1	31.47424700318123	47	10	10	9	0.1544587838435645	9	0.0309012378852502	0.9645799537384382	1.036720694976382
4	10	pat3462	47	1	31.47424700318123	47	11	11	10	0.1544587838435645	10	0.03717191800415368	0.9575468770700581	1.044335294643581
5	11	pat3462	47	1	31.47424700318123	47	12	12	11	0.1544587838435645	11	0.04348089827983864	0.9505225941983898	1.052052845564746
6	12	pat3462	47	1	31.47424700318123	47	13	13	12	0.1544587838435645	12	0.04348089827983864	0.9505225941983898	1.052052845564746
7	13	pat3462	47	1	31.47424700318123	47	15	15	13	0.1544587838435645	13	0.0563076936918843	0.9363999914558203	1.06791970218336
8	15	pat3462	47	1	31.47424700318123	47	16	16	15	0.1544587838435645	15	0.06278363172191356	0.9293497400031777	1.076021175834816
9	16	pat3462	47	1	31.47424700318123	47	17	17	16	0.1544587838435645	16	0.06278363172191356	0.9293497400031777	1.076021175834816

Showing 1 to 10 of 62 entries

Previous1 2 3 4 5 6 7Next

You can also visualize weights over time with plot_ccw_over_time() and mean values over time with plot_var_mean_over_time().

clones_long_weights |>
  plot_ccw_over_time()

clones_long_weights |>
  plot_var_mean_over_time("cov1")

Evaluate outcomes

We now have everything we need to conduct a CCW analysis. For instance, we can pipe things together to evaluate the hazard ratio for surgery vs no surgery:

library(survival)
df <- dummy_data |>
  create_clones(
    id = 'id',
    event = 'event',
    time_to_event = 'timetoevent',
    exposure = 'exposure',
    time_to_exposure = 'timetoexposure',
    ced_window = 100
  ) |>
  cast_clones_to_long() |>
  generate_ccw(c('cov1', 'cov2'))
#> Updating 4 patients' exposure and time-to-exposure based on CED window

coxph(Surv(t_start, t_stop, outcome) ~ clone, data = df, weights = weight_cox)
#> Call:
#> coxph(formula = Surv(t_start, t_stop, outcome) ~ clone, data = df, 
#>     weights = weight_cox)
#> 
#>         coef exp(coef) se(coef) robust se     z     p
#> clone 0.2125    1.2368   0.1390    0.1900 1.119 0.263
#> 
#> Likelihood ratio test=2.33  on 1 df, p=0.1272
#> n= 32983, number of events= 119

Note that we used outcome and not event in the coxph() model. Still, there is of course a problem with this analysis, as the cloning process renders the variance invalid. The simplest approach to addressing this is to bootstrap the variance. I have not made a function to do this yet, but leave the below as an example of how to do this.

library(boot)
#> 
#> Attaching package: 'boot'
#> The following object is masked from 'package:survival':
#> 
#>     aml

boot_cox <- function(data, indices) {
  
  # Make long data.frame with weights
  ccw_df <- data[indices, ] |>
    create_clones(
      id = 'id',
      event = 'event',
      time_to_event = 'timetoevent',
      exposure = 'exposure',
      time_to_exposure = 'timetoexposure',
      ced_window = 100
    ) |>
    cast_clones_to_long() |>
    generate_ccw(c('cov1', 'cov2'))
    

  # Extract HR from CoxPH
  cox_ccw <- coxph(Surv(t_start, t_stop, outcome) ~ clone, data = ccw_df, weights = weight_cox) 
   
  hr <- cox_ccw |>
    coef() |>
    exp()

  out <- c("hr" = hr)

  # Create survfit objects for each of treated and untreated
  surv_1 <- survfit(Surv(t_start, t_stop, outcome) ~ 1L, data = ccw_df[ccw_df$clone == 1, ], weights = weight_cox)

  surv_0 <- survfit(Surv(t_start, t_stop, outcome) ~ 1L, data = ccw_df[ccw_df$clone == 0, ], weights = weight_cox)

  # RMST difference
  rmst_1 <- surv_1 |>
    summary(rmean = 365) |>
    (\(summary) summary$table)() |>
    (\(table) table["rmean"])()
  
  rmst_0 <- surv_0 |>
    summary(rmean = 365) |>
    (\(summary) summary$table)() |>
    (\(table) table["rmean"])()

  rmst_diff <- rmst_1 - rmst_0

  out <- c(out, "rmst_diff" = rmst_diff)

  # 1-year survival difference
  # Find the index of the time point closest to 1 year
  index_1yr_1 <- which.min(abs(surv_1$time - 365))
  index_1yr_0 <- which.min(abs(surv_0$time - 365))

  # Get the 1-year survival probabilities
  surv_1_1yr <- surv_1$surv[index_1yr_1]
  surv_0_1yr <- surv_0$surv[index_1yr_0]

  surv_diff_1yr <- surv_1_1yr - surv_0_1yr

  out <- c(out, "surv_diff_1yr" = surv_diff_1yr)

}

boot_out <- boot(data = dummy_data, statistic = boot_cox, R = 10)
#> Updating 4 patients' exposure and time-to-exposure based on CED window
#> Updating 3 patients' exposure and time-to-exposure based on CED window
#> Updating 7 patients' exposure and time-to-exposure based on CED window
#> Updating 4 patients' exposure and time-to-exposure based on CED window
#> Updating 1 patients' exposure and time-to-exposure based on CED window
#> Updating 5 patients' exposure and time-to-exposure based on CED window
#> Updating 5 patients' exposure and time-to-exposure based on CED window
#> Updating 8 patients' exposure and time-to-exposure based on CED window
#> Updating 5 patients' exposure and time-to-exposure based on CED window
#> Updating 2 patients' exposure and time-to-exposure based on CED window
#> Updating 2 patients' exposure and time-to-exposure based on CED window

Hazard ratios

boot.ci(boot_out, type = "norm", index = 1)
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 10 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = boot_out, type = "norm", index = 1)
#> 
#> Intervals : 
#> Level      Normal        
#> 95%   ( 1.013,  1.609 )  
#> Calculations and Intervals on Original Scale

RMST

boot.ci(boot_out, type = "norm", index = 2)
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 10 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = boot_out, type = "norm", index = 2)
#> 
#> Intervals : 
#> Level      Normal        
#> 95%   (-52.68,   4.70 )  
#> Calculations and Intervals on Original Scale

1-year survival

boot.ci(boot_out, type = "norm", index = 3)
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 10 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = boot_out, type = "norm", index = 3)
#> 
#> Intervals : 
#> Level      Normal        
#> 95%   (-0.1783,  0.1729 )  
#> Calculations and Intervals on Original Scale

Competing risks

The extension of the package functionality to issues of competing risks is trivial. We keep our outcome variable as an integer but add additional values.

dummy_data$competing <- sample(0:2, nrow(dummy_data), replace = TRUE)
head(dummy_data) |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = FALSE
    )
  )

Search:

id	timetoexposure	exposure	event	timetoevent	cov1	cov2	competing
pat3462		0	0	47	1	31.47424700318123	2
pat3510	30	1	1	112	1	54.85926473516857	0
pat9717	28	1	0	226	0	45.28087406719721	0
pat3985		0	0	71	1	58.46738591048707	0
pat2841	56	1	1	324	1	59.25770592031112	0
pat4370		0	1	319	0	46.79169419060145	1

Showing 1 to 6 of 6 entries

We can then conduct the same analysis as before, but with the competing variable as the outcome. The bootstrapped analysis is not shown, but you can see that hte same approach can be used.

compete_ccw <- dummy_data |>
  create_clones(
    id = 'id',
    event = 'competing',
    time_to_event = 'timetoevent',
    exposure = 'exposure',
    time_to_exposure = 'timetoexposure',
    ced_window = 100
  ) |>
  cast_clones_to_long() |>
  generate_ccw(c('cov1', 'cov2'))
#> Updating 4 patients' exposure and time-to-exposure based on CED window

table(compete_ccw$outcome)
#> 
#>     0     1     2 
#> 32828    75    80

head(compete_ccw) |>
  DT::datatable(
    rownames = FALSE, 
    options = list(
      scrollX = TRUE,
      paging = FALSE
    )
  )

Search:

time_id	t_start	id	event	timetoevent	cov1	cov2	competing	fup_censor	fup_outcome	t_stop	t_event	lp	t	hazard	p_uncens	weight_cox
0	0	pat1040	1	211	1	50.53451332153781	2	100	7	7		-0.02544224035305076		0	1	1
1	7	pat1040	1	211	1	50.53451332153781	2	100	8	8	7	-0.02544224035305076	7	0.006105979314119561	0.9940650924669409	1.005970340954565
2	8	pat1040	1	211	1	50.53451332153781	2	100	9	9	8	-0.02544224035305076	8	0.02465368245296561	0.9762521753546439	1.024325502411023
3	9	pat1040	1	211	1	50.53451332153781	2	100	10	10	9	-0.02544224035305076	9	0.0309012378852502	0.9703242760122278	1.030583305727165
4	10	pat1040	1	211	1	50.53451332153781	2	100	11	11	10	-0.02544224035305076	10	0.03717191800415368	0.9644106297675454	1.036902714605119
5	11	pat1040	1	211	1	50.53451332153781	2	100	12	12	11	-0.02544224035305076	11	0.04348089827983864	0.9584972352258793	1.043299827322236

Showing 1 to 6 of 6 entries

Winsorizing weights

You can trim/winsorize extreme weights using the winsorize_ccw_weights() function:

compete_ccw_trim <- compete_ccw |>
  winsorize_ccw_weights(quantiles = c(0.10, 0.90))

max(compete_ccw$weight_cox)
#> [1] 2.931628
max(compete_ccw_trim$weight_cox)
#> [1] 2.610912

References

Hernán, Miguel A., and James M. Robins. “Using big data to emulate a target trial when a randomized trial is not available.” American journal of epidemiology 183.8 (2016): 758-764.

Gaber, Charles E., et al. “The Clone-Censor-Weight Method in Pharmacoepidemiologic Research: Foundations and Methodological Implementation.” Current Epidemiology Reports (2024): 1-11.

Maringe, Camille, et al. “Reflection on modern methods: trial emulation in the presence of immortal-time bias. Assessing the benefit of major surgery for elderly lung cancer patients using observational data.” International journal of epidemiology 49.5 (2020): 1719-1729.

Matthew Secrest

Clone Censor Weighting

Context

Create clones

Cast to long format

Generate weights

Evaluate outcomes

Hazard ratios

RMST

1-year survival

Competing risks

Winsorizing weights

References