情报科学 ›› 2022, Vol. 40 ›› Issue (4): 90-95.

• 业务研究 • 上一篇    下一篇

基于随机森林的Science和Nature期刊潜在精品论文识别研究 

  

  • 出版日期:2022-04-01 发布日期:2022-05-15

  • Online:2022-04-01 Published:2022-05-15

摘要: 【目的/意义】为推动潜在“精品”文献识别及其在科技文献识别与传播利用领域中的应用。【方法/过程】以
国际顶级期刊
ScienceNature期刊出版的论文及其引用分布数据为样本,统计出全部论文的首次响应时间、摘要
长度,总被引频次、资金资助、论文篇幅等特征,构建“精品”论文特征矩阵;然后基于“精品”论文特征矩阵和随机森
林算法进行潜在“精品”论文识别模型的训练与识别应用。【结果
/结论】研究结果显示,融合“精品”论文特征矩阵和
随机森林模型能够较好地识别
Science Nature 期刊中的潜在“精品”论文,模型正确识别分类的准确率均值达到
80%以上,其中Nature期刊的“精品”文献识别准确率高出Science期刊的“精品”论文识别准确率2%左右;使用信息
增益方法的模型识别效果比使用基尼不纯度方法的识别效果略好。此外,
ScienceNature期刊“精品”论文的首次
被引速度极快,在出版当年即被引用。【创新
/局限】“精品”文献特征矩阵和机器学习模型的结合能够较好地应用于
潜在“精品”论文的识别与推荐,然而未来需将模型推广应用于海量文献中“精品”论文的识别检验。

Abstract: Purpose/significanceTo promote the identification of potential "high-quality" literature and its application in the field of identification. Method/processThis paper takes the articles from journals named Science and Nature,as well as their citation distribu⁃tion data as sample.Such characteristics of each article as first-citation time,abstract length,total citation times,financial support and paper length was calculated to construct the feature matrix of "high-quality" articles. Then, based on the feature matrix of "highquality" articles and random forest algorithm,the recognition model of potential "high-quality" articles is trained and applied. Result/conclusionThe results show that the fusion of the feature matrix of "high-quality" articles and the random forest model can efficiently identify the potential "high-quality" articles from Science and Nature,and the model's average accuracy of correct recognition and classification is over 80%,among which the accuracy of identifying "high-quality" articles in the Nature was about 2% higher than that
in the Science.The model
s effect of recognition using the information gain method is slightly better than that using the Gini impurity method.In addition,the first citation of "high-quality" articles in the Science and Nature is extremely rapid,being cited within the year of publicationInnovation/limitationThe combination of "high-quality" literature feature matrix and machine learning model can be well applied to the identification and recommendation of potential "high-quality" articles in high-impact journals.However,in the fu⁃ture,the model needs to be popularized and applied to the identification and inspection of "high-quality" articles in massive literature.