情报科学 ›› 2025, Vol. 43 ›› Issue (7): 172-181.

• 博士论坛 • 上一篇    下一篇

融合早期形式与语义特征的论文被引预测研究

  

  • 出版日期:2025-07-05 发布日期:2025-10-16

  • Online:2025-07-05 Published:2025-10-16

摘要:

【目的/意义】本研究旨在通过预测学术论文的早期被引频次来识别潜在的高影响力论文。尽管早期研究
考虑了文献、作者和期刊等因素,但往往忽视了论文元数据中的文本语义信息,且存在数据的长时滞性问题。【方
法/过程】本研究收集了2008年至2017年信息资源管理领域10本中文期刊的研究数据。我们构建了16个关于论文
的形式特征,并采用预训练语言模型来构建论文元数据的语义特征,以增强模型的预测性能。此外,本研究引入了
深度森林模型来进行被引频次的预测。【结果/结论】结果显示,深度森林模型在预测被引频次方面普遍优于其他实
验算法。在基线实验结果中,深度森林F1均值比其他方法的均值高出2.65%。在融入论文的语义特征后,被引预测
模型的效果有明显的提升,F1结果优于其他模型平均值1.39%。语义特征在不同模型上均展示了稳定的性能增强
效果,场景一F1均值比其他方法的均值高出3.4%,场景二F1均值比其他方法的均值高出1.86%。【创新/局限】融合
语义特征与传统文献计量特征,采用深度森林算法预测,识别论文发表早期的潜在学术影响力。本文以中文图书
情报学领域学术论文为样本,研究结果可能无法覆盖所有科学领域。

Abstract:

【Purpose/significance】This study aims to identify potentially high-impact papers by predicting citation counts of academic
articles. Despite early research considering factors such as literature, authors, and journals, it often overlooks the textual semantic in⁃
formation in paper metadata and suffers from data longevity issues【. Method/process】This study collected research data from ten Chi⁃
nese journals in the field of Information Resource Management published from 2008 to 2017. We constructed 16 formal features of the
papers and utilized pre-trained language models to build semantic features of paper metadata to enhance the predictive performance of
the model. Additionally, this study introduced a deep forest model for predicting citation counts.【Result/conclusion】The results show
that the deep forest model generally outperforms other experimental algorithms in predicting citation counts. In baseline experiments,
the deep forest F1 mean is 2.65% higher than the mean of other methods. After incorporating semantic features of papers, the effective⁃
ness of the citation prediction model improved significantly, with F1 results surpassing the average of other models by 1.39%. Seman⁃
tic features demonstrated stable performance enhancement across different models, with F1 means in scenario one being 3.4% higher
than the mean of other methods, and in scenario two being 1.86% higher than the mean of other methods.【Innovation/limitation】By in⁃
tegrating semantic features with traditional bibliometric characteristics, this study employs a deep forest algorithm to predict and iden⁃
tify the potential academic impact of papers during their early publication stages. Using academic papers from the field of Chinese Li⁃
brary and Information Science as samples, the findings may not encompass all scientific disciplines.