Faculty Publications

Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models

Yingji Li, Jilin University
Mengnan Du, Ying Wu College of Computing
Rui Song, Jilin University
Xin Wang, Jilin University
Ying Wang, Jilin University

Document Type

Conference Proceeding

Publication Date

1-1-2024

Abstract

Human-like social bias of pre-trained language models (PLMs) on downstream tasks have attracted increasing attention. The potential flaws in the training data are the main factor that causes unfairness in PLMs. Existing data-centric debiasing strategies mainly leverage explicit bias words (defined as sensitive attribute words specific to demographic groups) for counterfactual data augmentation to balance the training data. However, they lack consideration of implicit bias words potentially associated with explicit bias words in complex distribution data, which indirectly harms the fairness of PLMs. To this end, we propose a Data-Centric Debiasing method (named Data-Debias), which uses an explainability method to search for implicit bias words to assist in debiasing PLMs. Specifically, we compute the feature attributions of all tokens using the Integrated Gradients method, and then treat the tokens that have a large impact on the model's decision as implicit bias words. To make the search results more precise, we iteratively train a biased model to amplify the bias with each iteration. Finally, we use the implicit bias words searched in the last iteration to assist in debiasing PLMs. Extensive experimental results on multiple PLMs debiasing on three different classification tasks demonstrate that Data-Debias achieves state-of-the-art debiasing performance and strong generalization while maintaining predictive abilities.

Identifier

85205292363 (Scopus)

ISBN

[9798891760998]

Publication Title

Proceedings of the Annual Meeting of the Association for Computational Linguistics

External Full Text Location

https://doi.org/10.18653/v1/2024.findings-acl.226

ISSN

0736587X

First Page

3773

Last Page

3786

Grant

20240402067GH

Fund Ref

International Science and Technology Cooperation Program of Jiangsu Province

Recommended Citation

Li, Yingji; Du, Mengnan; Song, Rui; Wang, Xin; and Wang, Ying, "Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models" (2024). Faculty Publications. 859.
https://digitalcommons.njit.edu/fac_pubs/859

This document is currently not available here.

COinS

DOI

10.18653/v1/2024.findings-acl.226

Faculty Publications

Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models

Document Type

Publication Date

Abstract

Identifier

ISBN

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Grant

Fund Ref

Recommended Citation

DOI

Search

Browse

Author Corner

Links

Faculty Publications

Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models

Authors

Document Type

Publication Date

Abstract

Identifier

ISBN

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Grant

Fund Ref

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links