Automatic identification of variables in epidemiological datasets using logic regression
Lorenz, Matthias W.; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bülbül, Alpaslan; Catapano, Alberico L.; Agewall, Stefan; Ezhov, Marat; Bots, Michiel L.; Kiechl, Stefan; Orth, Andreas; Norata, Giuseppe D.; Empana, Jean Philippe; Lin, Hung Ju; McLachlan, Stela; Bokemark, Lena; Ronkainen, Kimmo; Amato, Mauro; Schminke, Ulf; Srinivasan, Sathanur R.; Lind, Lars; Kato, Akihiko; Dimitriadis, Chrystosomos; Przewlocki, Tadeusz; Okazaki, Shuhei; Stehouwer, C. D.A.; Lazarevic, Tatjana; Willeit, Peter; Yanez, David N.; Steinmetz, Helmuth; Sander, Dirk; Poppert, Holger; Desvarieux, Moise; Ikram, M. Arfan; Bevc, Sebastjan; Staub, Daniel; Sirtori, Cesare R.; Iglseder, Bernhard; Engström, Gunnar; Tripepi, Giovanni; Beloqui, Oscar; Lee, Moo Sik; Friera, Alfonsa; Xie, Wuxiang; Grigore, Liliana; Plichart, Matthieu; Su, Ta Chen; Robertson, Christine; Schmidt, Caroline; Tuomainen, Tomi Pekka; Veglia, Fabrizio; Völzke, Henry; Nijpels, Giel; Jovanovic, Aleksandar; Willeit, Johann; Sacco, Ralph L.; Franco, Oscar H.; Hojs, Radovan; Uthoff, Heiko; Hedblad, Bo; Park, Hyun Woong; Suarez, Carmen; Zhao, Dong; Catapano, Alberico; Ducimetiere, Pierre; Chien, Kuo Liong; Price, Jackie F.; Bergström, Göran; Kauhanen, Jussi; Tremoli, Elena; Dörr, Marcus; Berenson, Gerald; Papagianni, Aikaterini; Kablak-Ziembicka, Anna; Kitagawa, Kazuo; Dekker, Jaqueline M.; Stolic, Radojica; Polak, Joseph F.; Sitzer, Matthias; Bickel, Horst; Rundek, Tatjana; Hofman, Albert; Ekart, Robert; Frauchiger, Beat; Castelnuovo, Samuela; Rosvall, Maria; Zoccali, Carmine; Landecho, Manuel F.; Bae, Jang Ho; Gabriel, Rafael; Liu, Jing; Baldassarre, Damiano; Kavousi, Maryam
(2017) BMC medical informatics and decision making [E], volume 17, issue 1
(Article)
Abstract
Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the
... read more
workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
show less
Download/Full Text
Keywords: Data management, Epidemiology, Logic regression, Meta-analysis, Health Policy, Health Informatics, Journal Article
ISSN: 1472-6947
Publisher: BioMed Central
Note: Funding Information: We thank Ingo Ruczinski, Charles Kooperberg, and Michael LeBlanc at the Fred Hutchinson Cancer Research Center in Seattle for providing the public license CRAN software package, and the related documentation. This manuscript was prepared using a limited access dataset of the Atherosclerosis Risk In Communities (ARIC) study, obtained from the National Heart, Lung and Blood Institute (NHLBI). The ARIC study is conducted and supported by NHLBI in collaboration with the ARIC Study investigators. This manuscript does not necessarily reflect the opinions or views of the ARIC study or the NHLBI. The Bruneck study was supported by the Pustertaler Verein zur Praevention von Herz-und Hirngefaesserkrankungen, Gesundheitsbezirk Bruneck, and the Assessorat fuer Gesundheit, Province of Bolzano, Italy. The Carotid Atherosclerosis Progression Study (CAPS) was supported by the Stiftung Deutsche Schlaganfall-Hilfe. The PLIC Study is supported by a grant from SISA Sezione Regionale Lombarda. This manuscript was prepared using data from the Cardiovascular Health Study (CHS). The research reported in this article was supported by contracts N01-HC-85079 through N01-HC-85086, N01-HC-35129, N01 HC-15103, N01 HC-55222, and U01 HL080295 from the National Heart, Lung, and Blood Institute, with additional contribution from the National Institute of Neurological Disorders and Stroke. A full list of participating CHS investigators and institutions can be found at http:// www.chs-nhlbi.org. The EVA Study was organized under an agreement between INSERM and the Merck, Sharp, and Dohme-Chibret Company. The Edinburgh Artery Study (EAS) was funded by the British Heart Foundation. The IMPROVE study was supported by the European Commission (Contract number: QLG1-CT-2002-00896), Ministero della Salute Ricerca Corrente, Italy, the Swedish Heart-Lung Foundation, the Swedish Research Council (projects 8691 and 0593), the Foundation for Strategic Research, the Stockholm County Council (project 562183), the Foundation for Strategic Research, the Academy of Finland (Grant #110413) and the British Heart Foundation (RG2008/014). The INVADE study was supported by the AOK Bayern. This manuscript was prepared using data from the Northern Manhattan Study (NOMAS) and the Oral Infections, Carotid Atherosclerosis and Stroke (INVEST) study. The NOMAS is funded by the National Institute of Neurological Disorders and Stroke (NINDS) grant R37 NS 029993 and INVEST by the National Institute of Dental and Craniofacial Research (NIDCR) grant R01 DE 13094. The Rotterdam Study was supported by the Netherlands Foundation for Scientific Research (NWO), ZonMw, Vici 918-76-619. The Study of Health in Pomerania (SHIP; http://ship.community-medicine.de) is part of the Community Medicine Research net (CMR) of the University of Greifswald, Germany. Collaborators within the PROG-IMT study group: Giuseppe D. Norata, PhD1,2, Jean Philippe Empana, MD, PhD3, Hung-Ju Lin, MD4, Stela McLachlan, PhD5, Lena Bokemark, MD, PhD6, Kimmo Ronkainen, MSc7, Mauro Amato, PhD8, Ulf Schminke, MD, Prof9, Sathanur R. Srinivasan, PhD, Prof.10, Lars Lind, MD, PhD, Prof11, Akihiko Kato, MD, Prof.12, Chrystosomos Dimitriadis, MD13, Tadeusz Przewlocki, MD, PhD, Prof.14, Shuhei Okazaki, MD15, CDA Stehouwer, MD, PhD, FESC16, Tatjana Lazarevic, MA17, Peter Willeit, PhD18,19, David N. Yanez, PhD, Assoc. Prof20, Helmuth Steinmetz, MD, Prof21, Dirk Sander, MD, Prof22, Holger Poppert, MD, PhD23, Moise Desvarieux, MD, PhD, Assoc. Prof.24, M. Arfan Ikram, MD, PhD, Assoc. Prof.25-27, Sebastjan Bevc, MD, PhD, Assist Prof28, Daniel Staub, MD, Prof.29, Cesare R. Sirtori, MD, PhD, Prof.30, Bernhard Iglseder, MD, Prof31,32, Gunnar Engström, MD, PhD, Prof. 33, Giovanni Tripepi, MSc34, Oscar Beloqui, MD, PhD35, Moo-Sik Lee, MD., PhD., Prof.36,37, Alfonsa Friera, MD38, Wuxiang Xie, MD, PhD, Assist. Prof.39, Liliana Grigore, MD40, Matthieu Plichart, MD, PhD41, Ta-Chen Su, MD, PhD, Assoc. Prof.4, Christine Robertson, MBChB5, Caroline Schmidt, PhD, Assoc. Prof.42, Tomi-Pekka Tuomainen, MD, PhD, Prof7, Fabrizio Veglia, PhD8, Henry Völzke, MD, Prof43,44, Giel Nijpels, MD, PhD45,46, Aleksandar Jovanovic, MD, PhD, Prof47, Johann Willeit, MD, Prof.18, Ralph L. Sacco, MD, MS, Prof.48, Oscar H. Franco, MD, PhD, FESC, FFPH, Prof. 49, Radovan Hojs, MD, PhD, Prof28,50, Heiko Uthoff, MD29, Bo Hedblad, MD, PhD, Prof33, Hyun Woong Park, M.D.36, Carmen Suarez, MD, PhD51, Dong Zhao, MD, PhD, Prof.39, Alberico Catapano, PhD, Prof.52,53, Pierre Ducimetiere, Prof.54, Kuo-Liong Chien, MD, Prof55, Jackie F. Price, MD5, Göran Bergström, MD, PhD, Prof56, Jussi Kauhanen, MD, Prof7, Elena Tremoli, PhD, Prof8,57, Marcus Dörr, MD, Prof.58, Gerald Berenson, MD, Prof.59, Aikaterini Papagianni, MD, Assoc. Prof.13, Anna Kablak-Ziembicka, MD, PhD, Prof.14, Kazuo Kitagawa, MD, PhD60, Jaqueline.M. Dekker, Prof61, Radojica Stolic, MD, PhD, Prof17, Stefan Kiechl, MD, Prof18, Joseph F. Polak, MD, MPH, Prof62, Matthias Sitzer, MD, Prof.63, Horst Bickel, PhD64, Tatjana Rundek, MD, PhD, Prof.48, Albert Hofman, MD, PhD, Prof.25, Robert Ekart, MD, PhD, Assist. Prof65, Beat Frauchiger, MD, Prof.66, Samuela Castelnuovo, PhD67, Maria Rosvall, MD, PhD, Assoc. Prof.68, Carmine Zoccali, MD, Prof.34, Manuel F Landecho, MD, PhD35, Jang-Ho Bae, MD.,PhD.,FACC.36,69, Rafael Gabriel, Prof., MD, Phd70, Jing Liu, MD, PhD, Prof.39, Damiano Baldassarre, PhD, Prof8, Maryam Kavousi, MD, PhD71. Funding Information: The PROG-IMT project was funded by the Deutsche Forschungsgemeinschaft (DFG Lo 1569/2-1 and DFG Lo 1569/2-3). Publisher Copyright: © 2017 The Author(s).
(Peer reviewed)