情报科学 ›› 2021, Vol. 39 ›› Issue (1): 135-141.

• 业务研究 • 上一篇    下一篇

基于时间序列聚类算法的叙词表新术语遴选研究 

  

  • 出版日期:2021-01-01 发布日期:2021-01-25

  • Online:2021-01-01 Published:2021-01-25

摘要: 【目的/意义】为保证叙词表术语收录的完整性,需要及时将领域出现但未收录的新术语补充收录到叙词表
中,结合候选词的时间及文档词频特征,从时间序列角度探索新术语的分布情况以指导新术语遴选是值得研究的
问题。【方法
/过程】文章主要对词汇文档词频对应的时间序列进行研究,将时间序列进行词频归一化及时间等长预
处理,引入
k-means聚类算法,对候选词汇进行基于时间序列趋势变化的聚类,探索术语以及非术语趋势变化的规
律,进而总结新术语应该满足的趋势变化特征。【结果
/结论】通过聚类研究,总结得出新术语普遍处于增长趋势。
实证将处于增长状态的候选词汇遴选出来,经过专家判断,该方法可以有效从候选词汇中遴选出其中能补充到叙
词表中的新术语,该方法有比较高的准确率。【创新
/局限】创新之处表现为叙词表新术语的遴选中同时考虑了时间
变化和文档词频因素,局限于数据处理规模,实证中只统计了论文关键词的词频数据。

Abstract: Purpose/significanceTo ensure the integrity of the thesaurus, it is necessary to timely include new terminology that ap⁃
pears in the field but not included in the thesaurus is combined with the time of the candidate words and the word frequency character⁃
istics of the document. It is worthwhile to explore the distribution of new terms from the perspective of time series to guide the selec⁃
tion of new terms.
Method/processThis paper mainly studies the time series corresponding to the document frequency of words, and
performs time-frequency normalization and time-equalization preprocessing, and introduces k-means clustering algorithm to cluster
the words based on time series trend change. The general rules of terminology and non-terminology trends, in turn, summarize the
trend-changing characteristics that new terms should satisfy.
Result/conclusionThrough clustering research, it is concluded that new
terms are generally in a growing trend. The selection of candidate words in the growth state, after expert judgment, the method can ef⁃
fectively select new terms from the candidate vocabulary that can be added to the thesaurus, the method has higher accuracy.
Inno⁃
vation/limitation
The innovation lies in the selection of new terms in thesaurus, which takes into account both the time variation and
the word frequency of documents, which is limited to the scale of data processing. In the empirical study, only the word frequency data
of key words in the paper are counted.