An adaptive wordpiece language model for learning Chinese word embeddings
Document Type
Conference Proceeding
Publication Date
8-1-2019
Abstract
Word representations are crucial for many nature language processing tasks. Most of the existing approaches learn contextual information by assigning a distinct vector to each word and pay less attention to morphology. It is a problem for them to deal with large vocabularies and rare words. In this paper we propose an Adaptive Wordpiece Language Model for learning Chinese word embeddings (AWLM), as inspired by previous observation that subword units are important for improving the learning of Chinese word representation. Specifically, a novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams. The semantical information extraction is completed by three elaborated parts i.e., extraction of morphological information, reinforcement of fine-grained information and extraction of semantical information. Empirical results on word similarity, word analogy, text classification and question answering verify that our method significantly outperforms several state-of-the-art methods.
Identifier
85072967510 (Scopus)
ISBN
[9781728103556]
Publication Title
IEEE International Conference on Automation Science and Engineering
External Full Text Location
https://doi.org/10.1109/COASE.2019.8843151
e-ISSN
21618089
ISSN
21618070
First Page
812
Last Page
817
Volume
2019-August
Grant
51775385
Fund Ref
National Natural Science Foundation of China
Recommended Citation
Xu, Binchen; Ma, Lu; Zhang, Liang; Li, Haohai; Kang, Qi; and Zhou, Mengchu, "An adaptive wordpiece language model for learning Chinese word embeddings" (2019). Faculty Publications. 7426.
https://digitalcommons.njit.edu/fac_pubs/7426
