单词嵌入表示学习综述

刘建伟; 高悦

引用本文:	刘建伟,高悦.单词嵌入表示学习综述[J].控制理论与应用,2022,39(7):1171~1193.[点击复制]
	LIU Jian-wei,GAO Yue.Survey of word embedding[J].Control Theory & Applications,2022,39(7):1171~1193.[点击复制]

单词嵌入表示学习综述

Survey of word embedding

摘要点击 3079 全文点击 1037 投稿时间：2021-07-27 修订日期：2021-12-23

查看全文查看/发表评论下载PDF阅读器 HTML

DOI编号 10.7641/CTA.2022.10678

2022,39(7):1171-1193

中文关键词单词嵌入表示学习神经网络语言模型跨语言双向编码器表示信息瓶颈

英文关键词 word embedding neural network language model: cross-lingual BERT information bottleneck

基金项目中国石油大学(北京)科研基金(2462020YXZZ023)资助

作者	单位	E-mail
刘建伟^*	中国石油大学(北京）	liujw@cup.edu.cn
高悦	中国石油大学(北京）

中文摘要

单词嵌入表示学习是自然语言处理(NLP)中最基本但又很重要的研究内容, 是所有后续高级语言处理任务的基础. 早期的单词独热表示忽略了单词的语义信息, 在应用中常常会遇到数据稀疏的问题, 后来随着神经语言模型(NLM)的提出, 单词被表示为低维实向量, 有效地解决了数据稀疏的问题. 单词级的嵌入表示是最初的基于神经网络语言模型的输入表示形式, 后来人们又从不同角度出发, 提出了诸多变种. 本文从模型涉及到的语种数的角度出发, 将单词嵌入表示模型分为单语言单词嵌入表示模型和跨语言单词嵌入表示模型两大类. 在单语言中, 根据模型输入的颗粒度又将模型分为字符级、单词级、短语级及以上的单词嵌入表示模型, 不同颗粒度级别的模型的应用场景不同, 各有千秋. 再将这些模型按照是否考虑上下文信息再次分类, 单词嵌入表示还经常与其它场景的模型结合, 引入其他模态或关联信息帮助学习单词嵌入表示, 提高模型的表现性能, 故本文也列举了一些单词嵌入表示模型和其它领域模型的联合应用. 通过对上述模型进行研究, 将每个模型的特点进行总结和比较, 在文章最后给出了未来单词嵌入表示的研究方向和展望.

英文摘要

Word embedding is the most basic research content in natural language processing (NLP), and it is a very important research direction. It is the basis of all advanced language processing tasks, such as using word vectors to complete various tasks in NLP. At the beginning, the one-hot ignored the semantic information of words and often led to data sparsity in application. Later, with the development of the neural language model (NLM), words were represented as dense and low-dimensional vectors, which effectively solved the problem of data sparsity and their high dimensionality. The input of the models based on neural network language models are word-level word embedding, but a variety of models have been proposed from different directions. In this survey, from the point of view of the number of languages utilizing in the model, we divide word embedding models into single-language word embedding and cross-language word embedding. In single-language, according to the granularity of model input, the model is divided into character-level, word-level, phraselevel and above word embedding model. The application scenarios of models with different granularity level are different and each has its own strengths. These models are further classified according to whether context information is considered. At the same time, word embedding is often combined with other models, which can help to learn word embedding by introducing other models or correlation information to improve the performance of the model. Therefore, in this survey, we also list some joint applications of word embedding models and other domain models. Through the study and introduction of the above models, the characteristics of each model are summarized and compared. Finally, the future research direction and prospect of word embedding are given.