Abstract
The identification of individual patients at risk of disease has become an integral part of recent trends towards a more personalized healthcare system. A healthcare system that is personalized allows the most applicable treatment to an individual patient given their risk profile. Clinical prediction models are situated as prime candidates
... read more
to assist clinicians with accurate risk estimates. However, as risk prediction models require complete information to generate predictions, these models are severely hampered whenever any patient or disease properties are missing.
In this thesis we show that, so far, most clinical prediction model studies that make use of machine learning (ML) techniques are not reporting enough on the presence or handling of missing data. Though ill-advised, deletion is used most often. Similarly, we show that the adherence of ML prediction model studies to current recommended reporting guidelines is also poor.
We present the development of several imputation methods for missing predictor values in real-time. We compared the accuracy of two common imputation methods with real-time capabilities: conditional modeling imputation (CMI, where for each predictor a separate multivariable imputation model is derived) and joint modeling imputation (JMI, where we assume all predictors are normally distributed and use the observed patient information to generate imputations for each missing predictor). We then compare these methodologies with mean imputation. Simulations found that both JMI and CMI are to be recommended in terms of imputation accuracy (i.e., RMSE). As JMI was faster and less complex, it was also evaluated using common evaluation methods for prediction (i.e., discrimination and calibration) and on the use of auxiliary variables. In summary, the use of JMI is found to be most beneficial when estimated in local data and with the use of these auxiliary variables. Its added value is most prominent whenever the missing predictors are correlated with other observed (auxiliary) variables.
To compare these imputation methods with a possible “built-in” designs we evaluated multiple missing data handling methods and compared them with JMI. Specifically, we evaluated pattern submodels (PS, where for each pattern, by which variables are missing, a separate prediction model is developed) and surrogate splits (SS, where an optimal replacement value is found among the available patient information which can serve as a replacement for the missing predictor). Provided multiple imputations are used, JMI is still to be preferred over PS and SS.
Still, prediction models need to be generalizable to clinical data. We show that internal-external cross-validation (IECV) is to be preferred, when the data is clustered, for assessing the generalizability of a prediction model during development. Additionally, we found that the accuracy of prediction models does not necessarily benefit from more complex modeling strategies, which shows that IECV is potentially useful for simplifying model complexity.
Yet, it is largely uncertain whether personalized medicine, in the form of CDSS, will offer the benefits it gives the impression of providing. First and foremost, and for fair comparison, the severe consequences of improper missing data handling must be appreciated and handled the right way.
show less