SmokeOut: An approach for testing clustering implementations
Document Type
Conference Proceeding
Publication Date
4-1-2019
Abstract
Clustering is a key Machine Learning technique, used in many high-stakes domains from medicine to self-driving cars. Many clustering algorithms have been proposed, and these algorithms have been implemented in many toolkits. Clustering users assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. We challenge these assumptions. We introduce SmokeOut, an approach and tool that pits clustering implementations against each other (and against themselves) while controlling for algorithm and dataset, to find datasets where clustering outcomes differ when they shouldn't, and measure this difference. We ran SmokeOut on 7 clustering algorithms (3 deterministic and 4 nondeterministic) implemented in 7 widely-used toolkits, and run in a variety of scenarios on the Penn Machine Learning Benchmark (162 datasets). SmokeOut has revealed that clustering implementations are fragile: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit; (2) different input parameters for that tool; (3) different toolkits.
Identifier
85067108534 (Scopus)
ISBN
[9781728117355]
Publication Title
Proceedings 2019 IEEE 12th International Conference on Software Testing Verification and Validation Icst 2019
External Full Text Location
https://doi.org/10.1109/ICST.2019.00057
First Page
473
Last Page
480
Grant
W911NF-13-2-0045
Fund Ref
National Science Foundation
Recommended Citation
Musco, Vincenzo; Yin, Xin; and Neamtiu, Iulian, "SmokeOut: An approach for testing clustering implementations" (2019). Faculty Publications. 7676.
https://digitalcommons.njit.edu/fac_pubs/7676
