情报科学 ›› 2022, Vol. 40 ›› Issue (6): 90-97.

• 业务研究 • 上一篇    下一篇

融合上下文特征和BERT词嵌入的新闻标题分类研究 

  

  • 出版日期:2022-06-01 发布日期:2022-06-12

  • Online:2022-06-01 Published:2022-06-12

摘要: 【目的/意义】随着社交媒体的发展,各类新闻数量激增,舆情监测处理越来越重要,高效精确的识别舆情新
闻可以帮助有关部门及时搜集跟踪突发事件信息并处理,减小舆论对社会的影响。本文提出一种融合
BERT
TEXTCNNBILSTM的新闻标题文本分类模型,充分考虑词嵌入信息、文本特征和上下文信息,以提高新闻标题类
别识别的准确率。【方法
/过程】将使用BERT生成的新闻标题文本向量输入到TEXTCNN提取特征,将TEXTCNN
的结果输入到 BILSTM 捕获新闻标题上下文信息,利用 softmax判断分类结果。【结果/结论】研究表明,本文提出的
融合了基于语言模型的
BERT、基于词向量 TEXTCNN 和基于上下文机制 BILSTM 三种算法的分类模型在准确
率、精确率、召回率和
F1值均达到了0.92以上,而且具有良好的泛化能力,优于传统的文本分类模型。【创新/局限】
本文使用
BERT进行词嵌入,同时进行特征提取和捕获上下文语义,模型识别新闻类别表现良好,但模型参数较多
向量维度较大对训练设备要求较高,同时数据类别只有
10类,未对类别更多或类别更细化的数据进行实验。

Abstract: Purpose/significanceWith the development of social media,the number of various kinds of news has surged,and the moni⁃toring and processing of public opinion has become more and more important.The efficient and accurate identification of public opin⁃ion news can help relevant departments timely collect and track the information of emergencies and deal with it,so as to reduce the im⁃pact of public opinion on the society.In this paper,a news title text classification model combining BERT,TEXTCNN and BILSTM is proposed, which takes full account of word embedding information, text features and context information to improve the accuracy of news title category recognition.Method/processThe news title text vector generated by BERT is input to TEXTCNN to extract fea⁃tures,and the results of TEXTCNN are input to BILSTM to capture the news title context information,and Softmax is used to judge the
classification results.
Result/conclusionThe research shows that the proposed classification model,which combines the three algo⁃rithms of language model based BERT,word vector based TEXTCNN and context mechanism based BILSTM,achieves more than 0.92 in accuracy,precision,recall rate and F1 value,and has good generalization ability,which is superior to the traditional text classification model.Innovation/limitationThis article uses the BERT word embedding,simultaneously feature extraction and capture context se⁃mantics,model recognition news category performance is good,but more model parameter vector dimension is larger for training equip⁃ment demand is higher,at the same time,the data category only 10 class,not more detailed data for more category or categories for ex⁃periments.