Faculty Publications

MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation

Fei Tan, Yahoo Research Labs
Changwei Hu, Yahoo Research Labs
Yifan Hu, Yahoo Research Labs
Kevin Yen, Yahoo Research Labs
Zhi Wei, Department of Computer Science
Aasish Pappu, Spotify USA Inc
Serim Park, Twitter, Inc.
Keqian Li, Yahoo Research Labs

Document Type

Article

Publication Date

10-1-2023

Abstract

In this work, we describe our efforts in addressing two typical challenges involved in the popular text classification methods when they are applied to text moderation: the representation of multibyte characters and word obfuscations. Specifically, a multihot byte-level scheme is developed to significantly reduce the dimension of one-hot character-level encoding caused by the multiplicity of instance-scarce non-ASCII characters. In addition, we introduce a simple yet effective weighting approach for fusing n-gram features to empower the classical logistic regression. Surprisingly, it outperforms well-tuned representative neural networks greatly. As a continual effort toward text moderation, we endeavor to analyze the current state-of-the-art (SOTA) algorithm bidirectional encoder representations from transformers (BERT), which works well in context understanding but performs poorly on intentional word obfuscations. To resolve this crux, we then develop an enhanced variant and remedy this drawback by integrating byte and character decomposition. It advances the SOTA performance on the largest abusive language datasets as demonstrated by our comprehensive experiments. Our work offers a feasible and effective framework to tackle word obfuscations.

Identifier

85124201486 (Scopus)

Publication Title

IEEE Transactions on Neural Networks and Learning Systems

External Full Text Location

https://doi.org/10.1109/TNNLS.2021.3137045

e-ISSN

21622388

ISSN

2162237X

PubMed ID

35113788

First Page

7014

Last Page

7023

Issue

Volume

Recommended Citation

Tan, Fei; Hu, Changwei; Hu, Yifan; Yen, Kevin; Wei, Zhi; Pappu, Aasish; Park, Serim; and Li, Keqian, "MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation" (2023). Faculty Publications. 1417.
https://digitalcommons.njit.edu/fac_pubs/1417

This document is currently not available here.

COinS

DOI

10.1109/TNNLS.2021.3137045

Faculty Publications

MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

e-ISSN

ISSN

PubMed ID

First Page

Last Page

Issue

Volume

Recommended Citation

DOI

Search

Browse

Author Corner

Links

Faculty Publications

MGEL: Multigrained Representation Analysis and Ensemble Learning for Text Moderation

Authors

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

e-ISSN

ISSN

PubMed ID

First Page

Last Page

Issue

Volume

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links