Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes
Riley, Richard D.; Snell, Kym I.E.; Ensor, Joie; Burke, Danielle L.; Harrell, Frank E.; Moons, Karel G.M.; Collins, Gary S.
(2019) Statistics in Medicine, volume 38, issue 7, pp. 1262 - 1275
(Article)
Abstract
In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting,
... read more
a suitable sample size is required in terms of the number of subjects (n) relative to the number of predictor parameters (p) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R2; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African-American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.
show less
Download/Full Text
The full text of this publication is not available.
Keywords: continuous outcome, linear regression, minimum sample size, multivariable prediction model, R-squared, Epidemiology, Statistics and Probability, Research Support, Non-U.S. Gov't, Journal Article, Research Support, N.I.H., Extramural
ISSN: 0277-6715
Publisher: John Wiley and Sons Ltd
Note: Funding Information: National Institute for Health Research School for Primary Care Research (NIHR SPCR); Netherlands Organisation for Scientific Research, Grant/Award Number: project 9120.8004 and 918.10.615; CTSA, Grant/Award Number: UL1 TR002243; National Center for Advancing Translational Sciences; US National Institutes of Health; NIHR Biomedical Research Centre Funding Information: Danielle Burke and Kym Snell are funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. Karel G.M. Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615). Frank Harrell's work on this paper was supported by CTSA award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the US National Institutes of Health. Gary Collins is supported by the NIHR Biomedical Research Centre, Oxford. Funding Information: Danielle Burke and Kym Snell are funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. Karel G.M. Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615). Frank Harrell's work on this paper was supported by CTSA award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the US National Institutes of Health. Gary Collins is supported by the NIHR Biomedical Research Centre, Oxford. We wish to thank three reviewers and an Associate Editor for their constructive comments, which helped improve the article upon revision. Danielle Burke and Kym Snell are funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. Karel G.M. Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615). Frank Harrell's work on this paper was supported by CTSA award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the US National Institutes of Health. Gary Collins is supported by the NIHR Biomedical Research Centre, Oxford. Publisher Copyright: © 2018 John Wiley & Sons, Ltd.
(Peer reviewed)