Abstract
The continuous generation and collection of routine care data in EHR (Electronic Health Record) databases provides a wealth of information, known as (medical) big data, that can be reused to generate medical knowledge and improve healthcare. However, data alone is useless; simple analytical tools are no longer sufficient to extract
... read more
valuable information due to the complex and high-dimensional nature of the data. Therefore, researchers have begun applying data science techniques to routine care data. Data science, a broad field that intersects with statistics, offers a range of methods for performing computations on large datasets, including data processing and analysis. Artificial Intelligence (AI), particularly machine learning (ML), enables algorithms to identify patterns in the data, learn rules, and make predictions, contributing to tasks such as label prediction and prognoses. Inspired by successes in other data sectors, many believed that data science would revolutionise the clinical world. This thesis aims to explore the challenges encountered in applying data science to tabular routine care data and to clarify some of the reasons why the promises have (so far) not been fulfilled. Through multidisciplinary collaboration with healthcare professionals from the University Medical Center Utrecht (UMCU), research questions and methodologies were defined to develop clinical decision-support tools, particularly for patients with the highest severity levels. As a result, data from both the Intensive Care Unit (ICU) and the Emergency Department (ED) were closely examined. Part I of the thesis focuses on the application of data science in the medical field, particularly examining medical data and labels, and the variations between different centers. It highlights the potential biases and hidden patterns introduced during the process of patient data collection and storage in EHR databases, which can impact study results if not properly accounted for. High-quality data and accurate labels are crucial for effective research and model training. A comparison of labels for suspected sepsis patients showed significant differences in identifying positive cases between claims-based and AEC-defined methods. Additionally, the study explored the challenges of transporting a machine learning model for early sepsis detection from the USA’s Beth Israel Deaconess Medical Center to UMC Utrecht, revealing significant differences in patient data, predictors, and clinical outcomes between the two centers. Part II of the thesis explores three data science applications using EHR data. It highlights how varying criteria and definitions for ‘baseline’ can significantly impact the calculated prevalence of acute kidney injury (AKI) in the emergency department, revealing the complexity of applying guidelines to routine care data. Additionally, the study questions the assumption that creatinine levels in patients' plasma are stable, challenging the reliability of the CKD-EPI formula used for estimating glomerular filtration rates. The effect of an AKI e-alert on physician behavior was also analyzed, showing increased clinical awareness and adjustments in care. Part III demonstrates the value of machine learning (ML) on EHR data, using high-quality labels to improve sepsis diagnosis and differentiate symptoms in immune-checkpoint inhibitor (ICI) patients, showing potential to enhance patient care and treatment plans.
show less