Associations between the urban exposome and type 2 diabetes: Results from penalised regression by least absolute shrinkage and selection operator and random forest models

Background: Type 2 diabetes (T2D) is thought to be influenced by environmental stressors such as air pollution and noise. Although environmental factors are interrelated, studies considering the exposome are lacking. We simultaneously assessed a variety of exposures in their association with prevalent T2D by applying penalised regression Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF)


Introduction
Type 2 diabetes (T2D) is a chronic disease with high individual and societal burden.Despite the genetic predisposition, environmental factors and lifestyle behaviours are important behavioural determinants in the etiology of T2D (Zheng et al., 2018).Environmental factors can affect the risk of T2D either directly (air pollutants, residential noise) or indirectly, by influencing lifestyle behaviours such as dietary habits and physical activity (walkability, green space) (Beulens et al., 2022).For instance, high neighbourhood walkability with more green areas and low levels of air pollution is associated with more physical activity (An et al., 2018;Barnett et al., 2017).With regard to direct environmental drivers, there is some evidence suggesting a potential link between T2D and exposure to arsenic in drinking water, persistent organic pollutants, pesticides, antibiotics and several other drugs, as well as atmospheric pollutants, such as nitrogen dioxide and fine particulate matter (Misra and Misra, 2020).
A recent systematic review by Beulens et al. summarized the existing evidence on the environmental risk factors of T2D (Beulens et al., 2022).There was robust evidence for the associations of air pollution, residential noise, neighbourhood walkability, green space, area-level socioeconomic deprivation and T2D and inconclusive evidence for association with outdoor temperature, neighbourhood social environment and food environment (Beulens et al., 2022;Pitt et al., 2017).
In real life environmental exposures and lifestyle behaviours are inseparable parts of the same complex structure that we call exposome.However, current evidence is mostly based on studies investigating single exposures (Wild, 2012).This might be problematic especially because such studies do not disentangle risks from associated environmental stressors.For instance, most studies focusing on air pollution did not consider green space or road traffic noise (Yang et al., 2020;Zare Sakhvidi et al., 2018).In the context of the exposome, only one study examined the association of 266 environmental factors measured in blood and urine samples with T2D.This was an environment wide association study (EXWAS) within the NHANES dataset based on biological measures of environmental, lifestyle and dietary exposures.Significant positive associations were found for the pesticide-derivative heptachlor epoxide, the vitamin c-tocopherol, and polychlorinated biphenyls and b-carotenes (Patel et al., 2010).Although the authors give a collective interpretation of results, using single or multivariable linear regression models may be ill-suited for exposome studies, because of a few reasons.First, these approaches do not consider complex interdependencies that exist between the exposures.Second, potential nonlinear exposure-outcome associations are mostly ignored.Third, when analysing a combination of highly correlated factors in a linear regression model simultaneously, generated effect estimates become unstable.Hence, more advanced statistical methods are required to analyse this type of high dimensional data.
In our previous work we showed how different methods could be applied to build interpretable and robust multi-exposure models (Ohanyan et al., 2022).Although there is not a gold standard approach for this type of analysis, the results showed that a combination of methods that complement each other by dealing with linear or nonlinear associations could be useful in capturing the overall picture of associations.For this reason, we used a linear model Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF) and Artificial Neural Networks (ANN) approaches.The last two methods can process nonlinear and non-additive associations without making any assumptions about the nature of the variables (Krogh, 2008;Stafoggia et al., 2017).LASSO is a more conventionally applied method for the purpose of variable selection among highly correlated variables (Petrovic et al., 2022;Stafoggia et al., 2017;Tibshirani, 1996).We also applied univariate analyses as to study the associations with established factors that in a multivariable model may not be selected due to low contribution to the overall fit.
The aim of this study was to examine the associations of a combination of 85 urban exposome factors and the prevalence of T2D, considering the nonlinear and non-additive associations and assess how our findings compare with the prior knowledge on established risk factors of T2D.

Study design and participants
We conducted a cross-sectional analysis using baseline data of the Occupational and Environmental Health Cohort (AMIGO) study.Participants across the Netherlands were randomly selected from the Dutch National General Practitioners Network database ("NIVEL Primary Care Registry," 2021).The only inclusion criterium was the age between 31 and 65 years old, as the target population were adults of working age from general population.Maximum one person per household were invited to complete the online questionnaire.Overall, 14,829 (16 % out of 93,550 invited) participants were included.A detailed description of the recruitment process, a flowchart of participant data and the ethical approval is provided elsewhere (Slottje et al., 2015).

Outcome variable
The outcome measure was prevalent T2D, assessed by self-reported questionnaires.Each participant responded to two questionnaire items: "Have you ever been diagnosed by a doctor with T2D ("non-insulin-dependent" or late-onset diabetes)?"and "Have you ever been diagnosed by a doctor with unknown type of diabetes?"(Slottje et al., 2015).Considering that T2D accounts for approximately 90 % of diabetes cases, we considered unknown type of diabetes as T2D.Among the participants who reported to have been diagnosed with T2D or unknown type of diabetes, a low number 43 (6 %) had reported age at the diagnostic less than 40 years, which might be type 1 diabetes.Therefore, a sensitivity analysis was performed where participants who reported unknown type of diabetes with an age at diagnoses less than 40 years, were not considered T2D cases.

Covariates
Self-reported questionnaire data on the duration of living at the current address, age, sex (male/female), country of birth (Netherlands/ other), country of birth of mother and of father (Netherlands/other), civil state (with/without a partner), current education (high (college, university degree) / low or medium (vocational education, community college, high school)), employment status (employed/unemployed), smoking (yes/no) were considered as covariates in this study.

Exposome factors
Geospatial models, monitoring stations, satellite data, and land use databases were used to assess a large set of environmental factors.These data were then linked to each respondent's geocoded residential address, to assess exposure at the home addresses of participants as a proxy for actual exposure (Martens et al., 2018).Exposure estimates were calculated for the questionnaire data collection period (2011)(2012).Overall, 85 exposures across a total of 12 exposure constructs were analysed: air pollution (19 factors), road traffic noise (1 factor), mobile phone base station radiofrequency electromagnetic field (1 factor), green space density (2 factors), outdoor light at night (1 factor), meteorology (2 factors), quality of the drinking water (29 factors), socio-demographic characteristics of the neighbourhood (16 factors), food environment (3 factors), built environment (10 factors) and road safety (1 factor).The assessment of these constructs and variables are detailed in Table 1 and supplementary material Table S1.

Statistical analysis 2.5.1. Data pre-processing
We excluded exposures that were judged uninformative based on the following reasons: i) variables with very low variability, e.g., when most observations (>99 %) had the same value, assessed by histograms and descriptive statistics (see the list in supplementary material, Table S2) or ii) if two variables were correlated at a level of r spearman ≥ 0.95.In the latter case only one out of the correlated variables was included in the analysis and was considered as a proxy for the other variable(s) (Table S3).Overall, 85 exposure variables were included in the analytical models.
Before building the models, all continuous exposures were standardized to the same scale by their standard deviations (Z-score).This step helps to maximize the comparability of variable importance scores  and minimize the impact of the measurement unit on the coefficients, independently from variable's original measurement units.

Missing values
The highest percentage of missing values was 11 % for the neighbourhood non-western immigrants.For the remainder of variables, the proportion of missing data was < 7 %, and the outcome measure had 2.8 % missing values.Five imputed datasets were generated using Multivariate Imputation via Chained Equations (MICE).Given the absence of a widely accepted way to combine the results from multiple imputation sets, as well as high computational cost of used statistical approaches, the imputed values were averaged across imputed datasets.Before introducing variables into the imputation model, some of them were transformed by logarithmic, root square, or inverse functions, to best Based on address density: 1=very highly urban ≥ 2,500 addresses per km 2 ; 2 = highly urban 1 500-2 500 addresses per km 2 ; 3 = moderately urban 1,000-1,500 addresses per km 2 ; 4 = less urban 500-1,000 addresses per km 2 ; 5 = non-urban < 500 addresses per km 2 approach a Gaussian distribution, as the imputation model assumes normal distribution for predictors (Osborne and Ph, 2005)(Table S4).Note that the nature of association was inversed for the multiplicative inverse transformed variables (multiplicative inverse = 1/variable) throughout the rest of statistical analysis (Table S4).The outcome measure was imputed using all variables.

Univariate and multivariate analysis
Most previously published studies had a single-exposure approach.To see if established risk factors were also existent in our sample, we analysed exposome factors in univariate logistic regression models adjusted for individual-level confounders (EXWAS).We interpreted the result in light of prior evidence as summarised by Beulens et al. (2022).We followed up with an agnostic analysis by including all variables simultaneously in multivariate models using penalised regression LASSO, RF and ANN.All multivariable models were based on the same pre-processed dataset.

Nested cross-validation
Current state-of-the-art suggests to use nested cross-validation for the combined tuning of hyperparameters and model selection (Krstajic et al., 2014).Nested cross-validation implies that hyperparameters are selected using the inner folds of cross-validation and, an unbiased estimate of the expected accuracy of the algorithm is computed across the outer folds of cross-validation (Wainer and Cawley, 2021).Thus, we divided the dataset into training and test sets (80 % and 20 % accordingly).The training data was in turn divided into five inner folds, each including 20 % of the data.During the cross-validation, the model was iteratively trained on four inner folds.The fifth fold was used as a validation set for hyperparameter tuning.In the outer loop of crossvalidation each of the outer folds was iteratively held out as a test set for the evaluation of model performance.
To maximize the comparability between statistical methods the same cross-validation folds were used for all models and stratified sampling on T2D case status (prevalence < 5 %) was used to create training and test sets.We compared predictive performances of multivariate models, using the logLoss metric (logistic loss or cross-entropy loss), which is based on probabilities and was suggested to be a better metric for model evaluation in imbalanced classification tasks (Harris and Samorani, 2021).A model that predicts perfectly would achieve a logLoss of zero, therefore the lower the logLoss, the better the prediction.

Penalised regression LASSO
LASSO is a penalised regression method that is commonly used in high dimensional data setting for variable selection (Tibshirani, 1996).LASSO forces the sum of the absolute value of the regression coefficients to be less than the tuning parameter lambda (λ) (Tibshirani, 1996).This causes the shrinkage of some coefficients to be zero, hence conducting a variable selection.The optimal value of lambda was selected using 5fold cross-validation.We used subsampling based stability selection to provide finite sample control of the family-wise error rate (Meinshausen and Bühlmann, 2010).Packages "glmnet" and "stabsel" in R were used to fit the LASSO model and for stability selection respectively.
The advantage for using LASSO is that it has good properties for variable selection among highly correlated variables, hence a good interpretability.It is easy to tune and requires a low computational time.However, it is a linear model, therefore it cannot disentangle complex nonlinear or non-additive associations.

Random forest
RF is an ensemble learning method where at each iteration a random subset of predictors and observations is selected to build a decision tree (Ishwaran and Lu, 2019).The predictions from these trees are then aggregated to form the forest.Permutation importance was used to assess the variable importance score and Shapley values were used to assess the directions of associations (Molnar, 2020).We used a scree plot to select variables with the highest variable importance.Packages "tuneRanger" and "ranger" were used to calibrate and to run the RF model.We calibrated the number of observations to sample for each decision tree ("sample.fraction"),the minimal size of terminal nodes to control for the depth of decision trees ("min.node.size"),and the number of variables to possibly split at each node ("mtry") using the package tuneRanger (Ohanyan et al., 2022).
RF can capture nonlinear and non-additive associations and recent developments in R software packages have drastically improved both the ease of the parameter tuning and the interpretability.

Artificial Neural Networks
The main structure of ANN consists of layers: one input layer, one or more hidden layers and one output layer.Each layer consists of neurons and weights attributed to neurons.The information passes along the network of layers until it reaches the output neurons.This is an artificial feed-forward neural network since the signals go towards one direction.The aim of a feed-forward ANN is pattern recognition; namely to find how input neurons (i.e., independent variables and covariates) predict output neurons (T2D: Yes/ No).The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network's predictions match what was expected (Chollet and Allaire, 2018).The optimizer uses this loss value to update the network's weights (Chollet and Allaire, 2018).
Parameters were calibrated through the nested cross-validation: number of hidden layers, epochs, learning rate of the Adam optimizer and penalization.The number of neurons on hidden layers (nodes) was set to 98 in each layer.The number of epochs is the number of iterations when the entire training data passes through the network, and after each epoch the weights are updated.A learning rate of 1e − 5 was used for the Adam optimizer.The results of cross-validation suggested an optimal value of 0.001 L1 penalty on weights starting from the second layer.The model used sigmoid activation function by design and cross-validated batch normalisation.The "keras" package in R was used for running ANN.
Similar to the RF, ANN can incorporate nonlinear associations and interactions.In recent years ANN gained popularity for its high predictive performance in various fields.The main disadvantages of the ANN are the high computational cost and poor interpretability.

Urban exposome and participants
Most participants (n = 14,829) were female(55.8%) and were on average 50.7 ± 9.4 years old.Over 70 % were employed and more than one third had a higher education(38.2%).Most respondents were originally from the Netherlands(95.3%) and were living with a partner (80.4 %).A total of 676(4.6 %) respondents had T2D (Table 2).
The correlation plot (Fig. 1) shows that intragroup correlations (based on untransformed data) were the strongest between air pollutants and socio-demographic characteristics of neighbourhoods.Variables representing the quality of drinking water had the lowest inter-and also intragroup correlations.Moderate level correlations existed between air pollutants and neighbourhood built environmental and sociodemographic factors.Green space was negatively correlated with air pollutants.Descriptive statistics of the factors of urban exposome can be found in Table S5.

Single exposure analysis
Our univariate logistic regression results confirmed most of the established risk factors, such as air pollutants (oxidative potential of PM2.5 (DTT), potassium in PM 10 ), neighbourhood SEP (neighbourhood average home values, high-income and low-income neighbourhoods), and urbanicity level.Despite the evidence from previous studies for an H. Ohanyan et al. association between outdoor noise, green space and T2D, our results did not confirm these associations.Among suspected risk factors, we found that surface temperature, heat island effect, and the share of non-Western immigrants in neighbourhood were associated with higher odds of having T2D.We also identified a few risk factors which were never studied before in relation to diabetes (sulphate in drinking water and neighbourhood proportion of divorced inhabitants).It should be noted that the p-values generally were not very low.Eight factors had pvalues lower than 0.01, among which the average home values (p < 0.001), high-income neighbourhoods (p < 0.001), low-income neighbourhoods (p < 0.01), temperature (p < 0.01), share of non-Western immigrants (p < 0.01) and heat island (p < 0.01), share of divorced inhabitants (p < 0.01), urbanicity level (p < 0.01).An overview of results from the univariate analysis is given in Table 3 and Table 4.

Multivariable analysis LASSO
Similar to the results of the univariate analyses, LASSO showed that living in economically deprived neighbourhoods (low neighbourhood home values, low share of high-income residents) was associated with a higher risk of T2D.Living in areas with higher proportion of non-Western immigrants and higher surface temperatures was also related to a higher risk of T2D.Residents of highly urban areas had a higher risk of T2D as compared to residents from less urban areas (Table 4).
Some other factors were identified by the LASSO, but were not selected after the stability selection procedure (Table 4, Table 5).From Table 4 it can be noted that the coefficients were generally low for all the factors.

Multivariable analysis random forest
Based on estimated variable importance from the RF model, four factors were selected: neighbourhood average home values, surface temperature, share of non-Western immigrants, and green space in 1 km buffer.Lower neighbourhood SEP and higher proportion of non-Western immigrants were related with a higher risk of T2D.In Shapley plots, the associations with temperature and green space appeared non-linear (Fig. 2).Similar to the coefficients from LASSO, relative effect sizes were generally low for all predictors, as indicated by the Shapley plots (Fig. 2).

Multivariable analysis ANN
The higher prediction error rate from the nested cross-validation of ANN indicated poor performance of this model.For comparison, the average logLoss from an empty model (random classification into two groups) was 0.186(0.0006)and for the ANN: 0.177(0.006).The performance of the ANN was thus only slightly better than the random classification.In addition, comparison of logLoss estimated during hyper-parameter tuning on the inner folds (0.173) to that obtained on T2D = Type 2 diabetes.
Fig. 1.Spearman correlations between constructs of exposures from the urban exposome.
H. Ohanyan et al. the outer folds (0.177) suggests possible overfitting of the model.For these reasons, we do not report results of the ANN in detail.

Comparison of the predictive performances and sensitivity analysis
Prediction error from the nested cross validation was lowest for the LASSO, when compared to RF and ANN.Average logLoss(sd) error across the outer folds was 0.168(0.003)for LASSO, 0.172(0.001)for RF and 0.177(0.006)for the ANN.Although the absolute value of the

Table 3
Results from the univariate analysis (EXWAS).All models were adjusted for individual confounding factors that were also used in all multivariate analysis.The exposures in all models were standardized(z-transformed).For random forest approach presented exposures are the top 10 important exposures.SEP = Socio-economic position; RF = Random Forest; DTT = dithiothreitol.* Indicates the factors for which the p-values were lower than 0.01 in singleexposure models, the factors that were selected after the stability selection in LASSO and the factors that had the highest variable importance scores in RF, as identified on scatterplot.
H. Ohanyan et al. average logLoss of RF was higher as compared to LASSO, the small standard error indicated on the possibility that the RF model was more stable.
The sensitivity analysis where eight participants with unknown type of diabetes who reported being diagnosed at <40 years old, were not considered as cases of T2D, showed very similar results for the univariate analysis, LASSO and RF (data not shown).

Discussion
We analysed a large set of environmental factors from the urban exposome as one complex system to identify the strongest predictors of T2D.Furthermore, we used univariate logistic regression to compare our findings with the knowledge from the literature, given that previously published studies mostly focused on single exposures without accounting for other related factors.Our univariate analyses based on known and suspected risk factors confirmed the associations of air pollution (oxidative potential of PM 2.5 (DTT), potassium in PM 10 ), urbanicity level, neighbourhood socio-economic position (SEP) (neighbourhood average home values, high-and low-income neighbourhoods), surface temperature and the share of non-Western immigrants in the neighbourhood with T2D risk, but not for road traffic noise and green space.The analyses in an agnostic framework (multivariable models of 85 exposures) identified associations for neighbourhood average home values, surface temperature, neighbourhood share of non-Western immigrants and green space in 1 km buffer.The factors with lower variable importance score (RF) or probability of selection (LASSO) were less consistent as they fluctuated between model runs.
Some known risk factors (air pollutants), which were identified in the univariate model, had more modest effects in more complex multivariable models.This may be because these risk factors have relatively small effects and do not add significantly to the predictive performance of the model.It could also be that they were falsely identified in previous research since other variables were not considered at the same time.The latter is possible, but as correlation patterns are likely to be different in  H. Ohanyan et al. each study, the chances are low that they would always result in the same bias.It should be noted that only a few risk factors were selected (stability selection-LASSO and scatterplot-RF) in both multivariable models, perhaps arguing for the small effect sizes.However, these results suggest that confounding by other risk-factors could be important.
The results of this study show that neighbourhood SEP and sociodemographic characteristics are associated with T2D.The negative association with neighbourhood SEP is largely supported by earlier studies, but little is known on the question of neighbourhood sociodemographic factors like the share of immigrants (Beulens et al., 2022).The share of non-Western immigrants is a multi-component factor.It contains elements of SEP, social and cultural interactions, eating behaviours and other genetic or biological factors, which could influence the risk of T2D.For example, South-East Asians have a higher genetic or biological risk of developing T2D for the same level of body mass index or waist circumference (Chan et al., 2014;Meeks et al., 2016;Yoon et al., 2006).Some relate this risk to the propensity to store fat viscerally rather than subcutaneously, the higher degree of insulin resistance, and the lower beta-cell function (Yoon et al., 2006).In our study the models were adjusted for the participant's and their parents' countries of origin in a crude way (Dutch vs non-Dutch), therefore we cannot exclude that this finding is reflecting the higher genetic/biological risk for T2D in certain ethnic groups.
We observed a nonlinear association with surface temperature, which was also seen in the univariate analysis and LASSO (Fig. 2).The association was positive in linear models, but took a parabolic shape with highlighted extreme values on Shapley plots for RF.This discrepancy brings uncertainty for the interpretation of this association.To gain more insight on the nature of this association, we compared Shapley plots with multivariable generalised additive model splines, which showed a rather positive association (Fig. S1).More longitudinal studies are required to confirm the potential association and measure the role of temperature heat extremes for diabetes.
The plotted association for the density of green space (1 km) looked similar to the plot of temperature (Fig. 2).Recent systematic reviews and meta-analysis showed that green space has been associated with 10-20 % lower risk of T2D (Beulens et al., 2022;Bilal et al., 2018;Dendup et al., 2018).The mechanism of association might be related to walkability and more physical activity, but green space is also related with reduced air pollution, noise and heat, compensating possible harmful effects.Furthermore, the density of green space is related with urbanicity level and other characteristics of the built environment.These interrelations with different aspects of the urban exposome could perhaps explain this nonlinear relationship (Fig. 2, Fig. S1).
The oxidative potential of PM 2.5 (DTT), potassium in PM 10 , the absorbance of PM 2.5 and the iron in PM 2.5 were identified by univariate and multivariable models, but none of these pollutants were selected.Although our univariate analysis did not find associations with PM 2.5 , PM 10 and NO 2 , a large meta-analysis showed a positive association with these risk factors (Yang et al., 2020).Current evidence suggests that air pollution might change endothelial function, trigger inflammation and insulin resistance (Beulens et al., 2022;Yang et al., 2020).An explanation for the differences observed between our univariate and multivariate models could be that there is a lack of studies exploring combinations of pollutants and other related factors.Another possible explanation could be that the effect sizes of these pollutants and also the contrast in our data sample were too small, therefore do not contribute to the predictive performance of the models.
Nested cross validation allowed the comparison of nearly unbiased estimates for model performance, because the final model performance evaluation was done on the holdout test set, which was not used during the training process.Surprisingly, our analysis showed that the predictive accuracy of the ANN model for our data was lower as compared to other multivariable methods.It should be noted that we only trained a regular feed-forward ANN.It is possible that ANNs with different architectures could have had a better performance.
Despite the increasing popularity of ANNs in a wide range of fields, in this setting, with many weak predictors and very imbalanced data (due to low prevalence of diabetes (<5%)), more simple methods like penalized regression may perform better.ANNs have been developed to learn from the data and are very powerful in strong predictive tasks, such as image or text processing.Considering ANN's computational burden, the need for a lot of data and difficulties in the training, in an exposome context it is perhaps better to use alternative methods with similar properties for dealing with multiple interrelated factors with complex nonlinear or non-additive associations.
Our study was based on data from a large, nationwide cohort enriched with a wide variety of urban exposome risk factors.All exposome factors were analysed simultaneously, using RF next to penalized linear model LASSO, to identify potential nonlinear and non-additive associations.Our study has several limitations.First, the lack of availability of the timing of T2D diagnosis and the cross-sectional design is limiting causal interpretation of the findings, as the temporal link cannot be established between the exposures and the outcome.Second, the AMIGO cohort study has a potential limitation by selection bias, given the low participation rate (<16 %).Slottje et al. compared the baseline and health-related characteristics of study participants with the source population.They found no consistent indications of systematic healthrelated participation bias, but men below 50 years of age and those with an intermediate level of education were under-represented among cohort members, while those born in the Netherlands were over-represented, probably in part due to the fact that the questionnaire was in Dutch.However, the authors concluded that given the achieved contrast between sociodemographic, environmental factors and the results of the health-related bias analysis, limited differences with the source population are not a major concern for the internal validity of the study and if generalization to general adult population is desired, these results can be used for weighting purposes (Slottje et al., 2015).Third, the outcome measure as well as the confounding factors were assessed through selfreported questionnaire data, which is prone to errors and to the bias of desirable reporting.However, studies show that self-reported T2D is a valid measure in large-scale epidemiological studies (Li et al., 2020;Pastorino et al., 2015;Sluijs et al., 2010).In addition, we performed a sensitivity analysis excluding participants with unknown type of diabetes and age at diagnostic less than 40 years (potentially misclassified as T2D), which generated similar results.Fourth, the environmental factors were a mixture of modelled and measured factors, likely containing both types of measurement errors: classical and Berkson's error (Agier et al., 2020).In general terms, this means that the sensitivity of models is lower for highly variable factors (if we repeat the exposure assessment several times, those with the lowest intra-class coefficient of correlation) compared to factors that are more stable over time (Agier et al., 2020;Ohanyan et al., 2022).
This study is one of the first to investigate the relations of various stressors from the urban exposome and the risk of T2D.Neighbourhood socio-economic and socio-demographic characteristics, surface temperature, urbanicity, and green space were related with the prevalence of T2D.Although effect sizes were small, on the population level the impact of these factors could be substantial.Therefore, targeted policy approaches that address socio-economic disparities on neighbourhoodlevel and measures for a better urban planning with more green areas could help to improve public health.

Fig. 2 .
Fig. 2. Shapley plots of the top predictors of RF model.It shows the average effect of each predictor on the predicted outcome.

Table 1
Description of data sources for each exposure.More detailed explanations can be found in the supplementary material.
Inhabitants with non-western origins (%) Average value of houses (x 1000 euros) Inhabitants with income below 40th percentile (%) Inhabitants with income above 20th percentile (%) (continued on next page) H.Ohanyan et al.

Table 1
(continued ) Average distance to the general practitioner's office, pharmacies and hospitals were grouped as "access to medical facilities".Distance to educational facilities (km)Average distance to kindergartens, elementary, middle, and high school facilities were grouped as "access to educational facilities".Distance to recreational facilities (km)Museums, cinemas, attraction parks, concert halls, swimming pools, ice skating halls, saunas and tanning clinics were grouped as "access to recreational activities" g., greengrocers, bakeries, butchers etc.) were categorised as "healthy food" exposure, and restaurants, fast food restaurants and takeaway places, cafés, pancake houses, bars and pubs were classified as "non-healthy food" exposure.Distance to healthier food retailers (km) Healthy food retailers (1km and 5km buffer) Non healthy food retailers (1km buffer)Quality of drinking waterCountrywide maps of drinking water quality are annually generated by the Dutch National Institute for Public Health and the Environment.Data from 2012 was used to assess the annual average values of 29 bacterial and chemical compounds, that were measured in the closest tested pump.H.Ohanyan et al.

Table 2
Characteristics of the participants from baseline data of Occupational and Environmental Health cohort (AMIGO).

Table 4
Beulens et al. (2022)of the findings across statistical methods and previous evidence from literature as reported in the systematic review byBeulens et al. (2022).

Table 5
Results from the penalised regression: LASSO.Selected factors after the stability selection procedure. *