情报科学 ›› 2021, Vol. 39 ›› Issue (7): 147-152.

• 博士论坛 • 上一篇    下一篇

基于多采样双向编码表示的网络舆情主题识别研究

  

  • 出版日期:2021-07-16 发布日期:2021-07-20

  • Online:2021-07-16 Published:2021-07-20

摘要: 【目的/意义】舆情主题识别一直是舆情领域的研究热点,如今已有丰富的研究成果。现有研究对舆情信息
进行表征时多采用了传统的词袋模型、主题模型或词向量模型,只能对词语进行唯一的向量表征且传统模型需对
文本分词,可能会因分词错误、数据稀疏、出现集外词等情况影响识别效果。【方法/过程】本文构建了一种基于多采
样双向编码表示的网络舆情主题识别模型,在训练前无需对文本进行分词,针对文本过长的情况采用头尾结合的
方式进行截断,从字、段、位置三个维度提取特征嵌入,通过自注意力机制进行舆情表征,在训练过程中使用区分性
微调和多采样dropout的方法增强泛化能力,提升识别效果。【结果/结论】实验结果表明构建模型在舆情主题分类任
务中表现良好,可以在不对文本分词的情况下实现对舆情主题的准确识别。【创新/局限】创新之处在于构建了一种
新型的网络主题识别模型,局限之处在于算法复杂,如何进一步调参优化是接下来的研究重点。

Abstract: 【Purpose/significance】Topic identification has always been a research hotspot in the field of public opinion, nowadays
there are abundant research findings. Existing work mostly uses the traditional bag-of-words model, LDA or word-vector model to rep?
resent public opinion information, which can only symbolize tokens uniquely and traditional model needs to segment words, which may
affect the identification effect due to segmentation error, data sparsity and out-of-vocabulary.【Method/process】This paper proposes atopic identification model of network public opinion based on multi-sample Bidirectional Encoder Representations from Transformers
and text does not need to be segmented before training. The method of combining head text with tail is used for truncation for the
long-text case. Embedding features are extracted from word, segment and position dimensions. The public opinion is represented by
self-attention mechanism. Fine-tuning learning rate and multi-sample dropout are used to enhance the generalization and improve
identification effect.【Result/conclusion】Results show that the proposed model performs well in the task of public opinion topic classi?
fication, and can identify topic accurately without text segmentation.【Innovation/limitation】The innovation of this article is the con?
struction of a new network topic identification model, however, the limitation lies in the complexity of the algorithm, ways to optimize
parameters is the focus of next research.