Automatic identification of variables in epidemiological datasets using logic regression
Lorenz, Matthias W.; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bülbül, Alpaslan; Catapano, Alberico L.; Agewall, Stefan; Ezhov, Marat; Bots, Michiel L.; Kiechl, Stefan; Orth, Andreas; Norata, Giuseppe D.; Empana, Jean Philippe; Lin, Hung Ju; McLachlan, Stela; Bokemark, Lena; Ronkainen, Kimmo; Amato, Mauro; Schminke, Ulf; Srinivasan, Sathanur R.; Lind, Lars; Kato, Akihiko; Dimitriadis, Chrystosomos; Przewlocki, Tadeusz; Okazaki, Shuhei; Stehouwer, C. D.A.; Lazarevic, Tatjana; Willeit, Peter; Yanez, David N.; Steinmetz, Helmuth; Sander, Dirk; Poppert, Holger; Desvarieux, Moise; Ikram, M. Arfan; Bevc, Sebastjan; Staub, Daniel; Sirtori, Cesare R.; Iglseder, Bernhard; Engström, Gunnar; Tripepi, Giovanni; Beloqui, Oscar; Lee, Moo Sik; Friera, Alfonsa; Xie, Wuxiang; Grigore, Liliana; Plichart, Matthieu; Su, Ta Chen; Robertson, Christine; Schmidt, Caroline; Tuomainen, Tomi Pekka; Veglia, Fabrizio; Völzke, Henry; Nijpels, Giel; Jovanovic, Aleksandar; Willeit, Johann; Sacco, Ralph L.; Franco, Oscar H.; Hojs, Radovan; Uthoff, Heiko; Hedblad, Bo; Park, Hyun Woong; Suarez, Carmen; Zhao, Dong; Catapano, Alberico; Ducimetiere, Pierre; Chien, Kuo Liong; Price, Jackie F.; Bergström, Göran; Kauhanen, Jussi; Tremoli, Elena; Dörr, Marcus; Berenson, Gerald; Papagianni, Aikaterini; Kablak-Ziembicka, Anna; Kitagawa, Kazuo; Dekker, Jaqueline M.; Stolic, Radojica; Polak, Joseph F.; Sitzer, Matthias; Bickel, Horst; Rundek, Tatjana; Hofman, Albert; Ekart, Robert; Frauchiger, Beat; Castelnuovo, Samuela; Rosvall, Maria; Zoccali, Carmine; Landecho, Manuel F.; Bae, Jang Ho; Gabriel, Rafael; Liu, Jing; Baldassarre, Damiano; Kavousi, Maryam
(2017) BMC medical informatics and decision making [E], volume 17, issue 1
(Article)
Abstract
Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the
... read more
workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
show less
Download/Full Text
Keywords: Data management, Epidemiology, Logic regression, Meta-analysis, Health Policy, Health Informatics, Journal Article
ISSN: 1472-6947
Publisher: BioMed Central
(Publisher version, Peer reviewed)