Class imbalance in binary classification tasks remains a significant challenge in machine learning, often resulting in poor performance on minority classes. This study comprehensively evaluates three widely-used strategies for handling class imbalance: Synthetic Minority Over-sampling Technique (SMOTE), Class Weights tuning, and Decision Threshold Calibration. We compare these methods against a baseline scenario across 15 diverse machine learning models and 30 datasets from various domains, conducting a total of 9,000 experiments. Performance was primarily assessed using the F1-score, with additional 9 metrics including F2-score, precision, recall, Brier-score, PR-AUC, and AUC. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold Calibration emerging as the most consistently effective technique. However, we observed substantial variability in the best-performing method across datasets, highlighting the importance of testing multiple approaches for specific problems. This study provides valuable insights for practitioners dealing with imbalanced datasets and emphasizes the need for dataset-specific analysis in evaluating class imbalance handling techniques.
Binary classification tasks frequently encounter imbalanced datasets, where one class significantly outnumbers the other. This imbalance can severely impact model performance, often resulting in classifiers that excel at identifying the majority class but perform poorly on the critical minority class. In fields such as fraud detection, disease diagnosis, and rare event prediction, this bias can have serious consequences.
One of the most influential techniques developed to address this challenge is the Synthetic Minority Over-sampling Technique (SMOTE), a method proposed by Chawla et al. (2002) that generates synthetic examples of the minority class. Since its introduction, the SMOTE paper has become one of the most cited papers in the field of imbalanced learning, with over 30,000 citations. SMOTE's popularity has spurred the creation of many other oversampling techniques and numerous SMOTE variants. For example, Kovács (2019) documents 85 SMOTE-variants implemented in Python, including:
Despite its widespread use, recent studies have raised some criticisms of SMOTE. For instance, Blagus and Lusa (2013) indicate limitations in handling high-dimensional data, while Elor and Averbuch-Elor (2022) and Hulse et al. (2007) suggest the presence of better alternatives for handling class imbalance. This highlights that while SMOTE is a powerful tool, it is not without limitations.
In this study, we aim to provide a more balanced view of techniques for handling class imbalance by evaluating not only SMOTE but also other widely-used strategies, such as Class Weights and Decision Threshold Calibration. These three treatment scenarios target class imbalance at different stages of the machine learning pipeline:
We compare these strategies with the Baseline approach (standard model training without addressing imbalance) to assess their effectiveness in improving model performance on imbalanced datasets. Our goal is to provide insights into which treatment methods offer the most significant improvements in performance metrics such as F1-score, F2-score, accuracy, precision, recall, MCC, Brier score, Matthews Correlation Coefficient (MCC), PR-AUC, and AUC. We also aim to evaluate these techniques across a wide range of datasets and models to provide a more generalizable understanding of their effectiveness.
To ensure a comprehensive evaluation, this study encompasses:
In total, we conduct 9,000 experiments involving the four scenarios, 15 models, 30 datasets, and validation folds. This extensive approach allows us to compare these methods and their impact on model performance across a wide range of scenarios and algorithmic approaches. It provides a robust foundation for understanding the effectiveness of different imbalance handling strategies in binary classification tasks.
We selected 30 datasets based on the following criteria:
The characteristics of the selected datasets are summarized in the chart below:
The dataset selection criteria were carefully chosen to ensure a comprehensive and practical study:
You can find the study datasets and information about their sources and specific characteristics in the following repository:
Imbalanced Classification Study Datasets
This repository is also linked in the Datasets section of this publication.
Our study employed a diverse set of 15 classifier models, encompassing a wide spectrum of algorithmic approaches and complexities. This selection ranges from simple baselines to advanced ensemble methods and neural networks, including tree-based models and various boosting algorithms. The diversity in our model selection allows us to assess how different imbalanced data handling techniques perform across various model types and complexities.
The following chart lists the models used in our experiments:
A key consideration in our model selection process was ensuring that all four scenarios (Baseline, SMOTE, Class Weights, and Decision Threshold Calibration) could be applied consistently to each model. This criterion influenced our choices, leading to the exclusion of certain algorithms such as k-Nearest Neighbors (KNN) and Naive Bayes Classifiers, which do not inherently support the application of class weights. This careful selection process allowed us to maintain consistency across all scenarios while still representing a broad spectrum of machine learning approaches.
To ensure a fair comparison, we used the same preprocessing pipeline for all 15 models and scenarios. This pipeline includes steps such as one-hot encoding, standard scaling, and missing data imputation. The only difference in preprocessing occurs in the SMOTE scenario, where synthetic minority class examples are generated. Otherwise, the preprocessing steps are identical across all models and scenarios, ensuring that the only difference is the algorithm and the specific imbalance handling technique applied.
Additionally, each model's hyperparameters were kept constant across the Baseline, SMOTE, Class Weights, and Decision Threshold scenarios to ensure fair comparisons.
The imbalanced data handling scenarios are implemented in a branch named imbalance
. A configuration file, model_config.json
, allows users to specify which scenario to run: baseline
, smote
, class_weights
, or decision_threshold
.
All model implementations are available in our public repositories, linked in the Models section of this publication.
To comprehensively evaluate the performance of the models across different imbalanced data handling techniques, we tracked the following 10 metrics:
Our primary focus is on the F1-score, a label metric that uses predicted classes rather than underlying probabilities. The F1-score provides a balanced measure of precision and recall, making it particularly useful for assessing performance on imbalanced datasets.
While real-world applications often employ domain-specific cost matrices to create custom metrics, our study spans 30 diverse datasets. The F1-score allows us to evaluate all four scenarios, including decision threshold tuning, consistently across this varied set of problems.
Although our analysis emphasizes the F1-score, we report results for all 10 metrics. Readers can find comprehensive information on model performance across all metrics and scenarios in the detailed results repository linked in the Datasets section of this publication.
Our experimental procedure was designed to ensure a robust and comprehensive evaluation of the four imbalance handling scenarios across diverse datasets and models. The process consisted of the following steps:
We employed a form of nested cross-validation for each dataset to ensure robust model evaluation and proper hyperparameter tuning:
Outer Loop: 5-fold cross-validation
Inner Validation: 90/10 train-validation split
This nested structure ensures that the test set from the outer loop remains completely unseen during both training and hyperparameter tuning, providing an unbiased estimate of model performance. The outer test set was reserved for final evaluation, while the inner validation set was used solely for hyperparameter optimization in the relevant scenarios.
Baseline: This scenario involves standard model training without any specific treatment for class imbalance. It serves as a control for comparing the effectiveness of the other strategies.
SMOTE (Synthetic Minority Over-sampling Technique): In this scenario, we apply SMOTE to the training data to generate synthetic examples of the minority class.
Class Weights: This approach involves adjusting the importance of classes during model training, focusing on the minority class weight while keeping the majority class weight at 1.
Decision Threshold Calibration: In this scenario, we adjust the classification threshold post-training to optimize the model's performance on imbalanced data.
Each scenario implements only one treatment method in isolation. We do not combine treatments across scenarios. Specifically:
This approach allows us to assess the individual impact of each treatment method on handling class imbalance.
For scenarios requiring hyperparameter tuning (SMOTE, Class Weights, and Decision Threshold), we employed a simple grid search strategy to maximize the F1-score measured on the single validation split (10% of the training data) for each fold.
The grid search details for the three treatment scenarios were as follows:
There are no scenario-specific hyperparameters to tune for the Baseline scenario. As a result, no train/validation split was needed, and the entire training set was used for model training.
For each experiment, we recorded the 10 performance metrics across the five test splits. In the following sections, we present the results of these extensive experiments.
This section presents a comprehensive analysis of our experiments comparing four strategies for handling class imbalance in binary classification tasks. We begin with an overall comparison of the four scenarios (Baseline, SMOTE, Class Weights, and Decision Threshold Calibration) across all ten evaluation metrics. Following this, we focus on the F1-score metric to examine performance across the 15 classifier models and 30 datasets used in our study.
Our analysis is structured as follows:
For the overall, model-specific, and dataset-specific analyses, we report mean performance and standard deviations across the five test splits from our cross-validation procedure. The final section presents the results of our statistical tests, offering a rigorous comparison of the four scenarios' effectiveness in handling class imbalance.
Figure 1 presents the mean performance and standard deviation for all 10 evaluation metrics across the four scenarios: Baseline, SMOTE, Class Weights, and Decision Threshold Calibration.
Figure 1: Mean performance and standard deviation of evaluation metrics across all scenarios. Best values per metric are highlighted in blue.
The results represent aggregated performance across all 15 models and 30 datasets, providing a comprehensive overview of the effectiveness of each scenario in handling class imbalance.
The results show that all three class imbalance handling techniques outperform the Baseline scenario in terms of F1-score:
This suggests that addressing class imbalance, regardless of the method, generally improves model performance as measured by the F1-score.
While our analysis primarily focuses on the F1-score, it's worth noting observations from the other metrics:
Based on these observations, Decision Threshold Calibration demonstrates strong performance across several key metrics, particularly those focused on minority class prediction (F1-score, F2-score, and Recall). It achieves this without compromising the calibration of probabilities of the baseline model, as evidenced by the identical Brier Score. In contrast, while SMOTE improves minority class detection, it leads to the least well-calibrated probabilities, as shown by its poor Brier Score. This suggests that Decision Threshold Calibration could be particularly effective in scenarios where accurate identification of the minority class is crucial, while still maintaining the probability calibration of the original model.
For the rest of this article, we will focus on the F1-score due to its balanced representation of precision and recall, which is particularly important in imbalanced classification tasks.
Figure 2 presents the mean F1-scores and standard deviations for each of the 15 models across the four scenarios. Each model's scores are averaged across the 30 datasets.
Figure 2: Mean F1-scores and standard deviations for each model across the four scenarios. Highest values per model are highlighted in blue.
Key observations from these results include:
Scenario Comparison: For each model, we compared the performance of the four scenarios (Baseline, SMOTE, Class Weights, and Decision Threshold Calibration). This within-model comparison is more relevant than comparing different models to each other, given the diverse nature of the classifier techniques.
Decision Threshold Performance: The Decision Threshold Calibration scenario achieved the highest mean F1-score in 10 out of 15 models. Notably, even when it wasn't the top performer, it consistently remained very close to the best scenario for that model.
Other Scenarios: Within individual models, Class Weights performed best in 3 cases, while SMOTE and Baseline each led in 1 case.
Consistent Improvement: All three imbalance handling techniques generally showed improvement over the Baseline scenario across most models, with 1 exception.
These results indicate Decision Threshold Calibration was most frequently the top performer across the 15 models. This suggests that post-model adjustments to the decision threshold is a robust strategy for improving model performance across different classifier techniques. However, the strong performance of other techniques in some cases underscores the importance of testing multiple approaches when dealing with imbalanced datasets in practice.
Figure 3 presents the mean F1-scores and standard deviations for each of the 30 datasets across the four scenarios.
Figure 3: Mean F1-scores and standard deviations for each dataset across the four scenarios. Highest values per dataset are highlighted in blue.
These results are aggregated across the 15 models for each dataset. While this provides insights into overall trends, in practice, one would typically seek to identify the best model-scenario combination for a given dataset under consideration.
Key observations from these results include:
Variability: There is substantial variability in which scenario performs best across different datasets, highlighting that there is no one-size-fits-all solution for handling class imbalance.
Scenario Performance:
Improvement Magnitude: The degree of improvement over the Baseline varies greatly across datasets, from no improvement to substantial gains (e.g., satellite vs abalone_binarized).
Benefit of Imbalance Handling: While no single technique consistently outperformed others across all datasets, the three imbalance handling strategies generally showed improvement over the Baseline for most datasets.
These results underscore the importance of testing multiple imbalance handling techniques for each specific dataset and task, rather than relying on a single approach. The variability observed suggests that the effectiveness of each method may depend on the unique characteristics of each dataset.
One notable observation is the contrast between these dataset-level results and the earlier model-level results. While the model-level analysis suggested Decision Threshold Calibration as a generally robust approach, the dataset-level results show much more variability. This apparent discrepancy highlights the complexity of handling class imbalance and suggests that the effectiveness of different techniques may be more dependent on dataset characteristics than on model type.
To rigorously compare the performance of the four scenarios, we conducted statistical tests on the F1-scores aggregated by dataset (averaging across the 15 models for each dataset).
We performed a repeated measures ANOVA to test for significant differences among the four scenarios. For this test, we have 30 datasets, each with four scenario F1-scores, resulting in 120 data points. The null hypothesis is that there are no significant differences among the mean F1-scores of the four scenarios. We use Repeated Measures ANOVA to account because we have multiple measurements (scenarios) for each dataset.
Following the significant ANOVA result, we conducted post-hoc pairwise comparisons using a Bonferroni correction to adjust for multiple comparisons. With 6 comparisons, our adjusted alpha level is 0.05/6 = 0.0083.
The p-values for the pairwise comparisons are presented in Table 1.
Table 1: P-values for pairwise comparisons (Bonferroni-corrected)
Scenario | Class Weights | Decision Threshold | SMOTE |
---|---|---|---|
Baseline | 7.77e-05 | 2.26e-04 | 1.70e-03 |
Class Weights | - | 2.06e-03 | 1.29e-01 |
Decision Threshold | - | - | 2.83e-02 |
Key findings from the pairwise comparisons:
These results suggest that while all three imbalance handling techniques (SMOTE, Class Weights, and Decision Threshold) significantly improve upon the Baseline, the differences among these techniques are less pronounced. The Decision Threshold approach shows a significant improvement over Baseline and Class Weights, but not over SMOTE, indicating that both Decision Threshold and SMOTE may be equally effective strategies for handling class imbalance in many cases.
Our comprehensive study on handling class imbalance in binary classification tasks yielded several important insights:
Addressing Class Imbalance: Our results strongly suggest that handling class imbalance is crucial for improving model performance. Across most datasets and models, at least one of the imbalance handling techniques outperformed the baseline scenario, often by a significant margin.
Effectiveness of SMOTE: SMOTE demonstrated considerable effectiveness in minority class detection, showing significant improvements over the baseline in many cases. It was the best-performing method for 30% of the datasets, indicating its value as a class imbalance handling technique. However, it's important to note that while SMOTE improved minority class detection, it also showed the worst performance in terms of probability calibration, as evidenced by its high Log-Loss and Brier Score. This suggests that while SMOTE can be effective for improving classification performance, it may lead to less reliable probability estimates. Therefore, its use should be carefully considered in applications where well-calibrated probabilities are crucial.
Optimal Method: Decision Threshold Calibration emerged as the most consistently effective technique, performing best for 40% of datasets and showing robust performance across different model types. It's also worth noting that among the three methods studied, Decision Threshold Calibration is the least computationally expensive. Given its robust performance and efficiency, it could be considered a strong default choice for practitioners dealing with imbalanced datasets.
Variability Across Datasets: Despite the overall strong performance of Decision Threshold Calibration, we observed substantial variability in the best-performing method across datasets. This underscores the importance of testing multiple approaches for each specific problem.
Importance of Dataset-Level Analysis: Unlike many comparative studies on class imbalance that report results at the model level aggregated across datasets, our study emphasizes the importance of dataset-level analysis. We found that the best method can vary significantly depending on the dataset characteristics. This observation highlights the necessity of analyzing and reporting findings at the dataset level to provide a more nuanced and practical understanding of imbalance handling techniques.
While our study provides valuable insights, it's important to acknowledge its limitations:
Fixed Hyperparameters: We used previously determined model hyperparameters. Future work could explore the impact of optimizing these hyperparameters specifically for imbalanced datasets. For instance, adjusting the maximum depth in tree models might allow for better modeling of rare classes.
Statistical Analysis: Our analysis relied on repeated measures ANOVA and post-hoc tests. A more sophisticated approach, such as a mixed-effects model accounting for both dataset and model variability simultaneously, could provide additional insights and is an area for future research.
Dataset Characteristics: While we observed variability in performance across datasets, we didn't deeply analyze how specific dataset characteristics (e.g., sample size, number of features, degree of imbalance) might influence the effectiveness of different methods. Future work could focus on identifying patterns in dataset characteristics that predict which imbalance handling technique is likely to perform best.
Limited Scope of Techniques: Our study focused on three common techniques for handling imbalance. Future research could expand this to include other methods or combinations of methods.
Performance Metric Focus: While we reported multiple metrics, our analysis primarily focused on F1-score. Different applications might prioritize other metrics, and the relative performance of these techniques could vary depending on the chosen metric.
These limitations provide opportunities for future research to further refine our understanding of handling class imbalance in binary classification tasks. Despite these limitations, our study offers valuable guidance for practitioners and researchers dealing with imbalanced datasets, emphasizing the importance of addressing class imbalance and providing insights into the relative strengths of different approaches.
Our study provides a comprehensive evaluation of three widely used strategies—SMOTE, Class Weights, and Decision Threshold Calibration—for handling imbalanced datasets in binary classification tasks. Compared to a baseline scenario where no intervention was applied, all three methods demonstrated substantial improvements in key metrics related to minority class detection, particularly the F1-score, across a wide range of datasets and machine learning models.
The results show that addressing class imbalance is crucial for improving model performance. Decision Threshold Calibration emerged as the most consistent and effective technique, offering significant performance gains across various datasets and models. SMOTE also performed well, and Class Weights tuning proved to be a reasonable method for handling class imbalance, showing moderate improvements over the baseline.
However, the variability in performance across datasets highlights that no single method is universally superior. Therefore, practitioners should consider testing multiple approaches and tuning them based on their specific dataset characteristics.
While our study offers valuable insights, certain areas could be explored in future research. We fixed the hyperparameters across scenarios to ensure fair comparisons, holding all factors constant except for the treatment. Future research could investigate optimizing hyperparameters specifically for imbalanced datasets. Additionally, further work could explore how specific dataset characteristics influence the effectiveness of different techniques. Expanding the scope to include other imbalance handling methods or combinations of methods would also provide deeper insights. While our primary analysis focused on the F1-score, results for other metrics are available, allowing for further exploration and custom analyses based on different performance criteria.
In conclusion, our findings emphasize the importance of addressing class imbalance and offer guidance on choosing appropriate techniques based on dataset and model characteristics. Decision Threshold Calibration, with its strong and consistent performance, can serve as a valuable starting point for practitioners dealing with imbalanced datasets, but flexibility and experimentation remain key to achieving the best results.
To support the reproducibility of our study and provide further value to researchers and practitioners, we have made several resources publicly available:
Model Repositories: Implementations of all 15 models used in this study are available in separate repositories. These can be accessed through links provided in the "Models" section of this publication.
Dataset Repository: The 30 datasets used in our study are available in a GitHub repository titled "30 Imbalanced Classification Study Datasets". This repository includes detailed information about each dataset's characteristics and sources.
Results Repository: A comprehensive collection of our study results is available in a GitHub repository titled "Imbalanced Classification Results Analysis". This includes detailed performance metrics and analysis scripts.
Hyperparameters: The hyperparameters used in the experiment are listed in the hyperparmeters.csv
file in the "Resources" section.
All project work is open-source, encouraging further exploration and extension of our research. We welcome inquiries and feedback from the community. For any questions or discussions related to this study, please contact the authors at contact@readytensor.com.
We encourage researchers and practitioners to utilize these resources to validate our findings, conduct further analyses, or extend this work in new directions.
There are no models linked
There are no datasets linked
There are no datasets linked
There are no models linked