Return to site


Statistical issues in survival analysis (Naturearticle 3556)


April 10, 2024

The authors aimed to assess overall survival rates forcolorectal cancer (CRC) at 3 years and also identify the associated prognostic factors amongst patients in Morocco using a machine learning approach, a random survival forest (RSF). CRC has currentlybeen shown to be the third most common cancer. The authors highlighted that RSF can accommodate nonlinearities andinteractions among variables, not being restricted by a baseline hazard
assumption like in Cox proportional hazards regression or by an assumption of
the multiplicative effect of predictor variables on the baseline hazard rate
during the period of observation. The data was collected retrospectively
between 2009 and 2015 until death or right censoring at the end of study.

In their analyses section, they admitted they could notconduct multiple imputation with machine learning but they adopted a single imputation of missing data approach using the missRanger algorithm, which uses an imputation method along with a RF algorithm combined with predictive mean matching, a non-parametric imputation method which makes no prior assumptions
about the distribution of the data. This directly predicted missing values
using the RF trained on the observed parts of the dataset. They then conducted
a multiple imputation relying on random forest (mice RF) with 10 datasets and a
single imputation based on random forest (missRanger). They first computed
Kaplan-Meier estimates of survival and compared them between curves using a
log-rank test which is also based on assumption of proportional hazards but the
authors did not discuss this issue and simply used it as is. Also, they compared their RSF fits to the Cox proportional hazard regression model fits.


The authors also used variable importance for covariateselection based on permutation which calculates the attributable prediction error of each predictor between datasets with and without the permuted values for the associated variable. They alsocalculated partial dependence plots to explore relationships between estimated partial effects of a given predictor and survival rates. Finally, they also assessed predictive accuracy by the concordance index (c-index) which assesses
model discrimination and the Brier score for the predictive accuracy, which
lies between 0 and 1.

They found that the results from their RSF corresponded tothe Cox model results in terms of parameter significance levels. Also, the c-index values and Brier scores were similar for both methods but yet the authors claim that the RSF had better discriminative capacity and predictive
accuracy. Furthermore, the Cox model is the only other survival analytic method
of which they compared RSF against and did not test against others and came to
the conclusion that RSF is much more flexible. They also admitted they never met the assumptions of the Cox model,which is a central tenant to its use. Furthermore, they did not try out
multiple imputation with the Cox model and then compare those results to the
imputation they had done with the RSF. Clearly, more rigorous comparison of
these methods are warranted.

 

Written by,

Usha Govindarajulu, PhD

 

Keywords: survival analysis, Cox model, random survival forest, Brier score, c-index, multiple
imputation

 

References

El Badisy, I., Ben Brahim, Z., Khalis, M. etal. Risk factors affecting patients survival with colorectal cancer inMorocco: survival analysis using an interpretable machine learning
approach. Sci Rep
14, 3556 (2024).https://doi.org/10.1038/s41598-024-51304-3