Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!
Document Type
Conference Proceeding
Publication Date
1-1-2023
Abstract
Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.
Identifier
85171735470 (Scopus)
ISBN
[9781665457019]
Publication Title
Proceedings International Conference on Software Engineering
External Full Text Location
https://doi.org/10.1109/ICSE48619.2023.00192
ISSN
02705257
First Page
2287
Last Page
2298
Recommended Citation
Yang, Xu; Wang, Shaowei; Li, Yi; and Wang, Shaohua, "Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!" (2023). Faculty Publications. 2043.
https://digitalcommons.njit.edu/fac_pubs/2043