Revealing Subgroups That Differ in Common and Distinctive Variation in Multi-Block Data: Clusterwise Sparse Simultaneous Component Analysis
Yuan, S.; de Roover, K.; Dufner, M.; Denissen, J.J.A.; van Deun, K.
(2021) Social Science Computer Review, volume 39, issue 5, pp. 802 - 820
(Article)
Abstract
Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act
... read more
together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality.
show less
Download/Full Text
Keywords: clustering, data integration, high-dimensional data analysis, Social Sciences(all), Computer Science Applications, Library and Information Sciences, Law
ISSN: 0894-4393
Publisher: SAGE Publications Inc.
Note: Funding Information: Shuai Yuan is a PhD student working at the Department of Methodology and Statistics, Tilburg University. His doctoral project aims to develop new big data analytical methods for social and behavioral sciences. Kim De Roover works as an assistant professor at the Department of Methodology and Statistics, Tilburg University. In her research, she combines component or factor analysis with clustering techniques to obtain hybrid methods for capturing heterogeneity in underlying covariance structure or measurement models of variables. She can be reached at k.deroover@uvt.nl Michael Dufner is a personality psychologist working at Medical School Berlin. His research examines topics such as self-perception, implicit personality, and social relations. He can be reached at dufnermi@googlemail.com Jaap J. A. Denissen works as a full professor at the Department of Developmental Psychology of Tilburg University. His broad research interests lie in various areas of personality psychology. He can be reached at jjadenissen@gmail.com Katrijn Van Deun works as an associate professor at the Department of Methodology and Statistics, Tilburg University. Her research focuses on the development of novel methods for exploration and prediction with high-dimensional multi-block data. She can be reached at k.vandeun@uvt.nl 1 Tilburg University, Tilburg, The Netherlands 2 University of Leipzig, Germany Shuai Yuan, Tilburg University, Warandelaan 2, Tilburg, The Netherlands. Email: s.yuan@uvt.nl This article is part of the SSCR special issue on “Big Data in the Behavioral and Social Sciences”, guest edited by Michael Bosnjak (Leibniz Institute for Psychology Information. Trier, Germany). 2019 0894439319888449 © The Author(s) 2019 2019 SAGE Publications This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License ( http://www.creativecommons.org/licenses/by-nc/4.0/ ) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages ( https://us.sagepub.com/en-us/nam/open-access-at-sage ). Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality. clustering data integration high-dimensional data analysis edited-state corrected-proof typesetter ts3 Authors' Note The authors thank the editor and the anonymous reviewers for providing helpful comments on earlier drafts of the article. Michael Dufner is now affiliated with Medical School Berlin, Germany. Data Availability The data used in the simulation can be reproduced by running the simulation R script that is available at https://github.com/syuanuvt/CSSCA under the section Simulation. The application data (i.e., personality data) are available on request from Jaap J. A. Denissen ( jjadenissen@gmail.com ). Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a personal grant from The Netherlands Organization for Scientific Research [NWO-Research Talent 406.17.526] awarded to Shuai Yuan. Software Information The simulation and the empirical were conducted using the R software for statistical computing. The scripts of the analysis are available at https://github.com/syuanuvt/CSSCA . There, users can also freely download the R package ClusterSSCA, which implements the CSSCA algorithm. Supplemental Material The online supplement to the article is available on PsychArchives at the following address http://dx.doi.org/10.23668/psycharchives.2601 Funding Information: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a personal grant from The Netherlands Organization for Scientific Research [NWO-Research Talent 406.17.526] awarded to Shuai Yuan. Publisher Copyright: © The Author(s) 2019.
(Peer reviewed)