情报科学 ›› 2021, Vol. 39 ›› Issue (8): 156-163.

• 博士论坛 • 上一篇    下一篇

关键词词频及语义特征对科技文献聚类的影响研究 

  

  • 出版日期:2021-08-01 发布日期:2021-08-05

  • Online:2021-08-01 Published:2021-08-05

摘要: 【目的/意义】针对基于关键词的科技文献聚类研究进行了一些探讨,包括:使用具有不同特征的关键词来
实现文献聚类在效果上有何差异;如何按特征对关键词进行选择来提高文献聚类效果。【方法
/过程】按照关键词词
频与语义类型特征设置对照组进行实证研究,观察其对文献聚类密度及文献语义表示效果的影响。【结果
/结论】单
独使用具有超高频、次高频、研究主题或限定范围特征的关键词进行文献聚类能使聚类密度较为合适;超高频特征
通常在其他频次中都具有体现,次高频词能同时反映不同频次的关键词特征,但次高频词对中频词特征的表示不
够全面;将语义类型不同的关键词分开来实现文献聚类,其效果好于将关键词进行组配,语义类型不同的关键词间
存在互斥性。【创新
/局限】本文发现了在以关键词间的共现关系为基础来进行文献聚类时单独选择次高频或某一
语义类别的关键词来实现文献聚类具有较好效果,但缺少对关键词间语义结构关系的进一步研究。

Abstract: Purpose/significanceThis paper discusses the research of scientific literature clustering based on keywords, including,
the use of keywords with different characteristics to achieve document clustering, what is difference in the clustering effect; how to se⁃lect keywords according to the characteristics to achieve better clustering results.
Method/processAccording to the frequency and se⁃mantic type of keyword, the control group was set for empirical study of literature clustering. To observe the impact of keyword charac⁃teristics on literature clustering density and semantic representation effect.Result/conclusionAccording to the research results, the proper clustering density can be obtained by using the keywords with the characteristics of ultra-high, sub-high, research topic or lim⁃ited range alone for literature clustering. Ultra-high frequency characteristics are usually reflected in other frequencies. Sub-high fre⁃quency words can simultaneously reflect the characteristics of keywords of different frequencies , but its ability to express the charac⁃teristics of the mid frequency words is not comprehensive enough; separating keywords with different semantic types and implement⁃ing document clustering is better than using words with multiple semantic types, and there is mutual exclusion between the keywords
with different semantic types.
Innovation/limitationThis paper finds that when document clustering is based on the co-occurrence re⁃lationship between keywords, it is effective to select sub-high frequency keywords or keywords of a certain semantic category to achieve document clustering. However, further research on the semantic structure relationship between keywords is lacking.