情报科学 ›› 2024, Vol. 42 ›› Issue (4): 36-42.

• 理论研究 • 上一篇    下一篇

基于中文文本类别信息的主题生成模型构建研究

  

  • 出版日期:2024-04-05 发布日期:2024-06-08

  • Online:2024-04-05 Published:2024-06-08

摘要:

【目的/意义】为了解决传统 LDA模型文本主题识别时语义描述不充分以及主题语义连贯性不强等问题,
本文尝试将文本类别信息融入 LDA 模型,形成一种基于中文文本类别信息的主题生成新模型,即 CLCI-LDA 模
型,为数据挖掘领域的文本分析和知识发现提供新的工具。【方法/过程】利用CLCI-LDA模型提取主题时,首先,采
用深度学习的句向量模型 Sentence-BERT 将文本转换为句嵌入向量,并与 LDA 模型生成的文档主题向量进行串
联,以提升文本向量的语义丰富性和关联性;然后,运用K-means聚类算法进行文本聚类,获得文本的类别信息;最
后,根据主题词频次,获取每个类族中的高频关键词,对主题进行凝练。【结果/结论】以我国“智慧图书馆”研究领域
为研究对象进行文献主题提取实验,对 CLCI-LDA 模型及传统 LDA模型的应用效果进行对比。结果表明 CLCI
LDA模型能够更好地获得具有语义信息的主题词,该模型获得的主题一致性指标优于传统的LDA模型。【创新/局
限】相比于传统LDA模型,CLCI-LDA模型在文本语义表示的深入性以及主题凝练的合理性方面均具有优势。但
新模型同时存在参数调优的不足、语义理解深度有待进一步提高的问题;此外 CLCI-LDA 模型的普适性还有待
检验。

Abstract:

【Purpose/significance】 In order to solve the problems of insufficient semantic description and weak topic semantic coher⁃
ence in traditional LDA models for text topic recognition, this paper attempts to integrate text category information into the LDA model,
forming a new topic generation model based on Chinese text category information, namely the CLCI-LDA model, which provides new
tools for text analysis and knowledge discovery in the field of data mining
.
【Method/process】 When using the CLCI-LDA model to ex⁃
tract topics, first, the Sentence BERT model of deep learning is used to transform the text into a sentence embedding vector, and con⁃
catenated with the document topic vector generated by the LDA model to improve the semantic richness and relevance of the text vec⁃
tor; Then, use the K-means clustering algorithm to cluster the text and obtain the category information of the text; Finally, based on the
frequency of topic words, obtain high-frequency keywords in each category family and condense the topic.【Result/conclusion】 A lit⁃
erature topic extraction experiment was conducted in the research field of "smart libraries" in China to compare the application effects
of the CLCI-LDA model and traditional LDA model. The results indicate that the CLCI-LDA model can better obtain topic words with
semantic information, and the topic consistency index obtained by this model is superior to traditional LDA models.【Innovation/limita⁃
tion】 Compared to traditional LDA models, the CLCI-LDA model has advantages in the depth of text semantic representation and the
rationality of topic condensation. However, the new model also has shortcomings in parameter tuning and the need for further improve⁃
ment in semantic understanding depth; In addition, the universality of the CLCI-LDA model still needs to be tested.