情报科学 ›› 2022, Vol. 40 ›› Issue (3): 99-108.

• 业务研究 • 上一篇    下一篇

基于语义聚类的关键词抽取方法

  

  • 出版日期:2022-03-01 发布日期:2022-03-08

  • Online:2022-03-01 Published:2022-03-08

摘要: 【目的/意义】关键词抽取的本质是找到能够表达文档核心语义信息的关键词汇,因此使用语义代替词语进
行分析更加符合实际需求。本文基于TextRank词图模型,利用语义代替词语进行分析,提出了一种基于语义聚类
的关键词抽取方法。【方法/过程】首先,将融合知网(HowNet)义原信息训练的词向量聚类,把词义相近的词语聚集
在一起,为各个词语获取相应的语义类别。然后,将词语所属语义类别的窗口共现频率作为词语间的转移概率计
算节点得分。最后,将TF-IDF值与节点得分进行加权求和,对关键词抽取结果进行修正。【结果/结论】从整体的关
键词抽取结果看,本文提出的关键词抽取方法在抽取效果上有一定提升,相比于TextRank算法在准确率P,召回率
R以及 F值上分别提升了 12.66%、13.77%、13.16%。【创新/局限】本文的创新性在于使用语义代替词语,从语义层面
对相关性网络进行分析。同时,首次引入融合知网义原信息的词向量用于关键词抽取工作。局限性在于抽取方法
依赖知网信息,只适用于中文文本抽取。

Abstract: 【Purpose/significance】The essence of keyword extraction is to find the key words that can express the core semantic infor‐
mation of the document, so using semantics instead of words for analysis is more in line with the actual needs. Based on TextRank
word graph model, this paper proposes a keyword extraction method based on semantic clustering by using semantic instead of word
analysis.【Method/process】The word embedding trained with HowNet semantic information are clustered to gather the words with simi‐lar meanings and obtain the corresponding semantic categories for each word. Then, the window co-occurrence frequency of the se‐mantic category of the word is taken as the transition probability between words to calculate the node score. At the same time, the TF-IDF value and the node score are weighted together to correct the keyword extraction results.【Result/conclusion】From the keyword extraction results, the keyword extraction method proposed in this paper has a certain improvement in the extraction effect. Compared with TextRank algorithm, the accuracy P, recall R and F values are improved by 12.66%, 13.77% and 13.16% respectively.【Innova‐tion/limitation】This paper analyzes the relevance network based on semantics, and measures the weight of words through semantic clustering. This idea can be combined with the existing methods based on word analysis to replace words from the semantic level. At the same time, this paper introduces the word vector fusing the semantic information of HowNet for keyword extraction for the first time, which can be used as a reference for other natural language processing work because the keyword extraction method in this paper relies on Chinese knowledge base HowNet, it is only suitable for Chinese text extraction, and does not have the universality of cross language application.