Model Fitting

Here we describe the basic steps of joint model fitting with {jmpost}.

Model Specification

Each JointModel needs to be specified with three parts:

longitudinal: The model for the longitudinal outcomes.
survival: The model for the survival outcomes.
link: The link that specifies how the longitudinal model parameters enter the survival model.

Default Options

Let’s first specify a very simple joint model with:

A random slope model for the longitudinal outcome.
A Weibull proportional hazards model for the survival outcome.
The link between the two models is that the random slope from the longitudinal model enters as a product with a link coefficient into the linear predictor of the survival model.

simple_model <- JointModel(
    longitudinal = LongitudinalRandomSlope(),
    survival = SurvivalWeibullPH(),
    link = linkDSLD()
)

Note that here we use all the default options for the two models and the link, in particular the prior distributions and the initial values in the MCMC chain for the parameters are automatically chosen. We can see this from the arguments of the constructors (or from the help page):

args(LongitudinalRandomSlope)
#> function (intercept = prior_normal(30, 10), slope_mu = prior_normal(1, 
#>     3), slope_sigma = prior_lognormal(0, 1.5), sigma = prior_lognormal(0, 
#>     1.5)) 
#> NULL

So here we see that the Longitudinal Random Slope model has 4 parameters that we can define a prior for.

Specifying Priors

We can alternatively also specify the prior distributions for the parameters manually. This is important in practice to obtain a meaningful model specification and hence converging MCMC chains that allow to estimate the posterior distributions.

For the random slope model for the longitudinal outcome, we can e.g. say:

random_slope_model <- LongitudinalRandomSlope(
    intercept = prior_normal(40, 5),
    slope_mu = prior_normal(10, 2)
)

This sets the prior for the intercept to be a \(N(40, 5)\) distribution and the prior for the slope_mu parameter to be a \(N(10, 2)\) distribution.

Separate Models

It is also possible to not link the longitudinal and the survival models, by using the special linkNone() link specification. For example,

simple_model_no_link <- JointModel(
    longitudinal = LongitudinalRandomSlope(),
    survival = SurvivalWeibullPH(),
    link = linkNone()
)

would allow to fit the two models separately, but in the same MCMC chain.

Single Models

It is also possible to specify only the longitudinal or the survival model. Then these can be estimated on their own with separate MCMC chains.

single_longitudinal <- JointModel(
    longitudinal = LongitudinalRandomSlope()
)
single_survival <- JointModel(
    survival = SurvivalWeibullPH()
)

Data Preparation

Before we can fit the models, we need to prepare the data in the right format.

Simulating Data

Here we start from a simulated data set.

We assign 50 subjects each to the two treatment arms.
We use a time grid from 1 to 2000, e.g. specifying the days after randomization.
We use an exponentially distributed censoring time with mean of 9000 days.
We use a categorical covariate with three levels A, B and C in the overall survival model, drawn uniformly from the three levels. (Note that this is hardcoded at the moment, so the levels need to be A, B, C.)
We use another continuous covariate in the overall survival model generated from a standard normal distribution, with coefficient 0.3.
For the longitudinal outcome, we draw the values from a random slope model with the given parameters.
For the survival outcome, we draw the true value from a Weibull model. Note that it is fairly easy to put here another choice, you just need to specify a function of time returning the log baseline hazard under the given survival model.

So let’s run the code for that:

set.seed(129)
sim_data <- SimJointData(
    design = list(
        SimGroup(50, "Arm-A", "Study-X"),
        SimGroup(50, "Arm-B", "Study-X")
    ),
    longitudinal = SimLongitudinalRandomSlope(
        times = c(1, 50, 100, 150, 200, 250, 300),
        intercept = 30,
        slope_mu = c(1, 2),
        slope_sigma = 0.2,
        sigma = 20,
        link_dsld = 0.1
    ),
    survival = SimSurvivalWeibullPH(
        lambda = 1 / 300,
        gamma = 0.97,
        time_max = 2000,
        time_step = 1,
        lambda_cen = 1 / 9000,
        beta_cat = c(
            "A" = 0,
            "B" = -0.1,
            "C" = 0.5
        ),
        beta_cont = 0.3
    )
)

We might get a message here that a few subjects did not die before the day 2000, but this is not of concern. Basically it just gives us a feeling of how many survival events are included in the data set.

Formatting Data

Next we bring the data into the right format.

We start with extracting data into individual data sets, and then reducing the longitudinal data to specific time points.

os_data <- sim_data@survival
long_data <- sim_data@longitudinal

Let’s have a quick look at the format:

The survival data has:

subject ID
time point
continuous covariate value
categorical covariate level
event indicator (1 for observed, 0 for censored)
study ID
treatment arm

head(os_data)
#> # A tibble: 6 × 7
#>   subject     study   arm    time event cov_cont cov_cat
#>   <chr>       <fct>   <fct> <dbl> <dbl>    <dbl> <fct>  
#> 1 subject_001 Study-X Arm-A    35     1   -1.12  B      
#> 2 subject_002 Study-X Arm-A    17     1   -0.990 C      
#> 3 subject_003 Study-X Arm-A   876     1   -1.37  C      
#> 4 subject_004 Study-X Arm-A   100     1   -1.36  C      
#> 5 subject_005 Study-X Arm-A     9     1    2.00  B      
#> 6 subject_006 Study-X Arm-A   122     1    0.696 B

The longitudinal data has:

subject ID
time point
sum of longest diameters (SLD)
study ID
treatment arm
observation flag

head(long_data)
#> # A tibble: 6 × 6
#>   subject     arm   study    time   sld observed
#>   <chr>       <fct> <fct>   <dbl> <dbl> <lgl>   
#> 1 subject_001 Arm-A Study-X     1  23.4 TRUE    
#> 2 subject_001 Arm-A Study-X    50 108.  FALSE   
#> 3 subject_001 Arm-A Study-X   100 137.  FALSE   
#> 4 subject_001 Arm-A Study-X   150 141.  FALSE   
#> 5 subject_001 Arm-A Study-X   200 220.  FALSE   
#> 6 subject_001 Arm-A Study-X   250 325.  FALSE

Finally, we wrap these in the data formatting functions. Here the mapping of the column names to the required variables happens. This means that in our applications we don’t have to use the same variable names as seen above, but we can use custom names and then apply the mapping here.

joint_data <- DataJoint(
    subject = DataSubject(
        data = os_data,
        subject = "subject",
        arm = "arm",
        study = "study"
    ),
    survival = DataSurvival(
        data = os_data,
        formula = Surv(time, event) ~ cov_cat + cov_cont
    ),
    longitudinal = DataLongitudinal(
        data = long_data,
        formula = sld ~ time,
        threshold = 5
    )
)

Now let’s have a look how we can fit the (joint) models.

Debugging Stan Code

It is always possible to read out the Stan code that is contained in the JointModel object, using write_stan():

tmp <- tempfile()
write_stan(simple_model, destination = tmp)
first_part <- head(readLines(tmp), 10)
cat(paste(first_part, collapse = "\n"))
#> functions {
#>     //
#>     // Source - base/functions.stan
#>     //
#> 
#>     // Constant used in below.
#>     real neg_log_sqrt_2_pi() {
#>         return -0.9189385332046727;
#>     }

Sampling Parameters

Finally, sampleStanModel() is kicking off the MCMC sampler via cmdstanr in the backend. Note that in practice you would increase the number of warm-up and sampling iterations.

mcmc_results <- sampleStanModel(
    simple_model,
    data = joint_data,
    iter_sampling = 500,
    iter_warmup = 500,
    chains = 1,
    parallel_chains = 1
)
#> Running MCMC with 1 chain...
#> 
#> Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup)
#> Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
#> Chain 1 Exception: gamma_lpdf: Random variable is inf, but must be positive finite! (in '/tmp/Rtmpb6WNy1/model-19a1625d1533.stan', line 507, column 4 to column 100)
#> Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
#> Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
#> Chain 1
#> Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 1 finished in 24.9 seconds.

Convergence checks

After the sampling finishes, we can inspect the parameter distributions. This is using the cmdstanr functions, because the results element is of class CmdStanMCMC.

vars <- c(
    "lm_rs_intercept",
    "lm_rs_slope_mu",
    "lm_rs_slope_sigma",
    "lm_rs_sigma",
    "link_dsld",
    "sm_weibull_ph_lambda",
    "sm_weibull_ph_gamma",
    "beta_os_cov"
)
as.CmdStanMCMC(mcmc_results)$summary(vars)
#> # A tibble: 11 × 10
#>    variable        mean   median      sd     mad       q5     q95  rhat ess_bulk
#>    <chr>          <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl> <dbl>    <dbl>
#>  1 lm_rs_inte…  2.77e+1 27.7     1.29    1.31    25.4     29.7    1.00      254.
#>  2 lm_rs_slop…  1.02e+0  1.02    0.0296  0.0285   0.972    1.07   0.998     775.
#>  3 lm_rs_slop…  1.96e+0  1.96    0.0296  0.0303   1.91     2.01   1.00      938.
#>  4 lm_rs_slop…  2.04e-1  0.203   0.0148  0.0158   0.181    0.229  1.00      938.
#>  5 lm_rs_sigma  1.92e+1 19.2     0.543   0.570   18.4     20.1    0.999     799.
#>  6 link_dsld   -5.73e-2 -0.0472  0.200   0.188   -0.391    0.277  1.00      539.
#>  7 sm_weibull…  1.00e-2  0.00863 0.00562 0.00440  0.00371  0.0204 1.00      372.
#>  8 sm_weibull…  8.85e-1  0.882   0.0711  0.0755   0.770    1.00   1.00      689.
#>  9 beta_os_co… -1.53e-5 -0.00640 0.239   0.233   -0.372    0.399  1.00      764.
#> 10 beta_os_co…  3.34e-1  0.333   0.261   0.264   -0.0937   0.750  1.00      866.
#> 11 beta_os_co…  3.65e-1  0.370   0.104   0.0972   0.190    0.531  1.00     1167.
#> # ℹ 1 more variable: ess_tail <dbl>

We can already see here from the rhat statistic whether the MCMC sampler converged - values close to 1 indicate convergence, while values larger than 1 indicate divergence.

In general, convergence is sensitive to the choice of:

Priors
Initial values
Sufficient warm-up iterations

If you don’t achieve convergence, then play around with different choices of the above.

Plotting

We can now proceed towards investigating the results of the MCMC chain with plots.

Trace plots

Using the draws() method on the results element, we can extract the samples in a format that is understood by bayesplot. We can e.g. look at some simple trace plots:

vars_draws <- as.CmdStanMCMC(mcmc_results)$draws(vars)

library(bayesplot)
#> This is bayesplot version 1.11.1
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting

mcmc_trace(vars_draws)

Longitudinal fit plots

Using the longitudinal() method we can extract the longitudinal fit samples from the result, and then plot them for all subjects or those that we are interested in. For illustration, we will plot here the first 10 subjects:

selected_subjects <- head(os_data$subject, 10)
long_quantities <- LongitudinalQuantities(
    mcmc_results,
    grid = GridFixed(
        subjects = selected_subjects
    )
)
as.data.frame(long_quantities) |> head()
#>         group time  values
#> 1 subject_001    0 26.0238
#> 2 subject_001    0 27.5690
#> 3 subject_001    0 26.4398
#> 4 subject_001    0 26.6774
#> 5 subject_001    0 28.1050
#> 6 subject_001    0 26.0869

summary(long_quantities) |> head()
#>         group time   median    lower    upper
#> 1 subject_001    0 27.69405 25.14073 30.12683
#> 2 subject_002    0 27.69405 25.14073 30.12683
#> 3 subject_003    0 27.69405 25.14073 30.12683
#> 4 subject_004    0 27.69405 25.14073 30.12683
#> 5 subject_005    0 27.69405 25.14073 30.12683
#> 6 subject_006    0 27.69405 25.14073 30.12683

autoplot(long_quantities)

Survival fit plots

And using the SurvivalQuantities() method we can do the same for the estimated survival functions.

surv_quantities <- SurvivalQuantities(
    mcmc_results,
    grid = GridFixed(
        subjects = selected_subjects
    )
)
as.data.frame(surv_quantities) |> head()
#>         group time values
#> 1 subject_001    0      1
#> 2 subject_001    0      1
#> 3 subject_001    0      1
#> 4 subject_001    0      1
#> 5 subject_001    0      1
#> 6 subject_001    0      1

summary(surv_quantities) |> head()
#>         group time median lower upper
#> 1 subject_001    0      1     1     1
#> 2 subject_002    0      1     1     1
#> 3 subject_003    0      1     1     1
#> 4 subject_004    0      1     1     1
#> 5 subject_005    0      1     1     1
#> 6 subject_006    0      1     1     1

autoplot(surv_quantities)

We can also aggregate the estimated survival curves from groups of subjects, using the corresponding method.

surv_quantities <- SurvivalQuantities(
    mcmc_results,
    grid = GridGrouped(
        groups = split(os_data$subject, os_data$arm)
    )
)
autoplot(surv_quantities, add_km = TRUE)

Predicting Survival Quantities for Hypothetical Subjects

The SurvivalQuantities() method can also be used to predict survival quantities for hypothetical subjects with any arbitrary covariate and longitudinal model parameter values. This is done via passing a GridPrediction() object to the method where the newdata and params arguments are used to specify the desired values for the covariates and longitudinal model parameters respectively. For example below we use these functions to predict the survival distribution for a hypothetical subject:

surv_quantities <- SurvivalQuantities(
    mcmc_results,
    grid = GridPrediction(
        times = seq(1, 800, by = 50),
        newdata = dplyr::tibble(
            cov_cat = c("A", "B"),
            cov_cont = c(1.2, 2)
        ),
        params = list(
            intercept = 40,
            slope = 0.05
        )
    )
)
autoplot(surv_quantities)

The exact names that are required for the params argument (i.e. the longitudinal model parameters) can be found by running getPredictionNames() on the longitudinal model object. For example:

getPredictionNames(LongitudinalGSF())
#> [1] "b"   "s"   "g"   "phi"

Brier Score

The brierScore() method can be used to extract the Brier Scores (predictive performance measure) from our SurvivalQuantities object.

sq <- SurvivalQuantities(
    mcmc_results,
    grid = GridFixed(times = c(1, 50, 100, 400, 800)),
    type = "surv"
)
brierScore(sq)
#>          1         50        100        400        800 
#> 0.01951390 0.17167799 0.23677714 0.14894734 0.06115057

Initial Values

By default jmpost will set the initial values for all parameters to be a random value drawn from the prior distribution that has been shrunk towards the mean of said distribution e.g. for a prior_normal(4, 2) the initial value for each chain will be:

4 * shrinkage_factor + rnorm(1, 4, 2) * (1 - shrinkage_factor)

Note that the shrinkage factor is set to 0.5 by default and can be changed via the jmpost.prior_shrinkage option e.g.

options("jmpost.prior_shrinkage" = 0.7)

If you wish to manually specify the initial values you can do so via the initialValues() function. For example:

joint_model <- JointModel(
    longitudinal = LongitudinalRandomSlope(),
    survival = SurvivalExponential(),
    link = linkNone()
)
initial_values <- initialValues(joint_model, n_chains = 2)

initial_values[[1]]$lm_rs_intercept <- 0.2
initial_values[[2]]$lm_rs_intercept <- 0.3

mcmc_results <- sampleStanModel(
    joint_model,
    data = DataJoint(...),
    init = initial_values,
    chains = 2
)

Note the following: - initialValues() will return a list of lists where each sublist contains the initial values for the corresponding chain index. - initialValues() by default just returns 1 initial value for each parameter; the sampleStanModel() function will then broadcast this number as many times as required e.g. if you have 3 covariates in your survival model then the initial value for the Beta coeficient will be repeated 3 times. If however you want to specify individual initial values for each covariate you can do so by passing in a vector of the same length as the number of covariates. - initialValues() will not check if the proposed values are valid for constrained parameters. That is, if you are using a prior_cauchy(0, 1) prior for a parameter that should be >0 then you will need to manually set the initial value (as described above) to ensure that it is a valid value. - For constrained parameters (e.g. variance parameters that must be \(> 0\)) initialValues() will continuously sample and discard initial values until it generates one that meet the constraints. If after 100 attempts no valid initial value has been found then it will throw an error.