Labeling News Article’s Subject Using Uncertainty Based Active Learning
Document Type
Conference Proceeding
Publication Date
1-1-2021
Abstract
In Natural Language Processing, labeling a text corpus is often an expensive task that requires a lot of human efforts and cost. Whereas unlabeled text corpora in varying domains are readily available. For a couple of decades, research efforts have concentrated on algorithms that can be used for labeling the corpus, thus minimizing the number of articles required to be labeled manually. Semi-Supervised Learning and Active Learning have been a great promise for labeling the articles using a trained model. Also, Semi-Supervised learning algorithms and Active learning algorithms have strong theoretical guarantees. This study aims to tag 1183 articles from The New York Times and The Wall Street Journal with the subject (i.e. primary organization related to news articles) employing Active Learning algorithm. We used Active Learning algorithm which uses Random Sampling along with Uncertainty Based Querying. This Active Learning approach is used to train Naïve Bayes classifier using Bag of Words features. This classifier is used to tag 1183 articles of which only 167 required manual review, thus achieving reduction of 85.89% with 78.18% accuracy. Also, for verifying quality of labeled corpus, SVM classifier using same features was trained on labeled corpus giving accuracy of 74.45% on test data.
Identifier
85111083932 (Scopus)
ISBN
[9783030760625]
Publication Title
Lecture Notes of the Institute for Computer Sciences Social Informatics and Telecommunications Engineering Lnicst
External Full Text Location
https://doi.org/10.1007/978-3-030-76063-2_15
e-ISSN
1867822X
ISSN
18678211
First Page
200
Last Page
208
Volume
372
Recommended Citation
Parekh, Meet and Patel, Yash, "Labeling News Article’s Subject Using Uncertainty Based Active Learning" (2021). Faculty Publications. 4509.
https://digitalcommons.njit.edu/fac_pubs/4509