情报科学 ›› 2021, Vol. 39 ›› Issue (7): 99-107.

• 业务研究 • 上一篇    下一篇

基于SFM-DCNN的层次特征文本分类研究

  

  • 出版日期:2021-07-16 发布日期:2021-07-20

  • Online:2021-07-16 Published:2021-07-20

摘要: 【目的/意义】对互联网产生的大量文本数据进行有效分类,提高文本处理效率,为企业用户决策提供建
议。【方法/过程】针对传统的词向量特征嵌入无法获取一词多义,特征稀疏、特征提取困难等问题,本文提出了一种
基于句子特征的多通道层次特征文本分类模型(SFM-DCNN)。首先,该模型通过Bert句向量建模,将特征嵌入从
传统的词特征嵌入升级为句特征嵌入,有效获取一词多义、词语位置及词间联系等语义特征。其次,通过构建多通
道深度卷积模型,将句特征从多层级来获取隐藏特征,获取更接近原语义的特征。【结果/结论】采用三种不同的数
据对模型进行验证分析,采用对比相关的分类方法,SFM-DCNN模型准确率较其他模型分类性能有所提高,这说
明该模型具有一定的借鉴意义。【创新/局限】基于文本分类中存在的一词多义、特征稀疏问题,创新性地利用Bert来
抽取全局语义信息,并结合多通道深层卷积来获取局部层次特征,但限于时间和设备条件,模型没有进行进一步的
预训练,实验数据集不够充分。

Abstract: 【Purpose/significance】This paper tries to find an effective way to classify the non-structured text, aiming to improve the ef?
ficiency of corporate problem solving and provides decision advice to enterprise users.【Method/process】In order to solve the problems of word vector feature embedding, such as unavailability of polysemous words, sparse features and difficulty in feature extraction, this paper proposes a multi-channel hierarchical feature text classification model based on sentence features (SFM-DCNN).First, the mod? el upgrades feature embedding from traditional word feature embedding to sentence feature embedding through Bert sentence vector modeling, effectively obtaining semantic features such as word polysemy, word position and inter-word association. Secondly, the em? bedded features from multiple levels for feature reinforcement to obtain features closer to the original semantics by constructing a multi-channel deep convolution model.【Result/conclusion】We examined the proposed model with three different texts and found its precision is much higher than the traditional methods, indicating that the model has certain reference significance.【Innovation/limita?tion】Based on the problem of polysemy and feature sparsity in text classification, Bert is used to extract global semantic informationand multi-channel deep convolution is used to obtain local hierarchical features, however, due to time and equipment conditions, the model has not been further trained and the experimental data set is not sufficient.