Improving the Predictive Analytics of Machine-Learning Pipelines for Bridge Infrastructure Asset Management Applications: An Upstream Data Workflow to Address Data Quality Issues in the National Bridge Inventory Database
Document Type
Article
Publication Date
1-1-2024
Abstract
The increasing availability of bridge data from the National Bridge Inventory (NBI) offers a great opportunity to perform predictive analytics (such as bridge deterioration prediction) using machine learning (ML) pipelines for supporting bridge asset management. However, data quality issues (e.g., outliers and missing values) can significantly affect ML pipelines, requiring upstream tasks to be performed for ensuring the validity, applicability, and generalizability of pipelines. Among the tasks, outlier removal and missing value imputation are the most challenging due to a highly laborious process, a lack of data governance, and a mixture of heterogenous data quality issues and data types. To address this challenge, this paper proposes an upstream workflow for enhancing the downstream predictive analytics of bridge-related ML pipelines. The proposed upstream workflow was developed based on the NBI data collected for all States in the United States, which includes a total of 617,084 observations/bridges. Existing bridge domain knowledge from multiple sources (such as the bridge design manual and regulations) was leveraged to remove outliers. Then, this study applied and compared 10 statistical and ML-based data imputation techniques to impute missing values. Statistical analysis and imputation evaluation of NBI data indicated that: (1) 19 and 15 out of the total 38 frequently used features or variables had outliers and missing values, respectively; (2) categorical features are generally more prone to data dropping due to inapplicable values, while numeric features are more subjected to outliers; and (3) ML-based data imputation is more suitable than statistical imputation for both numeric and categorical features, especially for features with high missing rate. The proposed workflow was validated on its capability of improving downstream predictive analytics for bridge deck condition prediction, increasing the balanced accuracy by 6.85%-9.76%. This paper contributes to the body of knowledge by offering a novel upstream workflow that can be utilized as a benchmark for guiding researchers and bridge engineering practitioners to handle NBI data quality issues for better preforming predictive analytics using ML pipelines.
Identifier
85175302540 (Scopus)
Publication Title
Journal of Bridge Engineering
External Full Text Location
https://doi.org/10.1061/JBENF2.BEENG-6012
e-ISSN
19435592
ISSN
10840702
Issue
1
Volume
29
Recommended Citation
Hu, Xi and Assaad, Rayan H., "Improving the Predictive Analytics of Machine-Learning Pipelines for Bridge Infrastructure Asset Management Applications: An Upstream Data Workflow to Address Data Quality Issues in the National Bridge Inventory Database" (2024). Faculty Publications. 1141.
https://digitalcommons.njit.edu/fac_pubs/1141