Abstract
Background: In many types of binary classification problems, one class is represented by a large number of examples (the majority class) while the minority class is represented by only a few, a phenomenon known as class-imbalance. Many algorithms are biased towards predicting the majority class correctly and fail to correctly
... read more
classify the minority group; this is known as the class imbalance problem. Previous research has shown that one-dimensional complexity increases the class imbalance problem, and that multi-dimensionality too exacerbates the class imbalance problem. This study explored how complexity influences class prediction of multi-dimensional imbalanced data with 8 commonly used classification algorithms. Additionally, two predictive performance metrics, accuracy and Matthew’s correlation coefficient (MCC), were compared to reveal the influence of performance metrics in imbalanced data.
Results: Different scenarios were compared by means of data simulation under different levels of complexity, numbers of dimensions and class balances. First, it was observed that increasing dimensionality or complexity resulted in a decrease of predictive performance, especially in severely imbalanced datasets. Secondly, it was observed that when complexity increases, the effect of dimensionality is enhanced. Some classifiers proved to be more suitable for complex and multi-dimensional class-imbalanced datasets, with Support Vector Machines outperforming other classifiers in the most complex, multi-dimensional and more imbalanced scenarios.
However, SVM was more sensitive to the class imbalance problem than Diagonal Linear Discriminant Analysis, which had a better performance in the minority group for the most extremely imbalanced samples (99:1 class balance). Accuracy as a measure of predictive performance was much less representative of the performance in the minority group, than MCC.
Conclusions: Overall, researchers using class-imbalanced data should consider the amount of imbalance in their data as well as its dimensionality and complexity, ultimately leading to more appropriate selection of classification algorithms. Furthermore, researchers using imbalanced data should contemplate the importance of correct predictions for the minority class vs. the majority class. This study underlines the importance of using a measure that is robust against class imbalance, and favors the use of the MCC as opposed to accuracy.
show less