情报科学 ›› 2021, Vol. 39 ›› Issue (11): 173-179.

• 博士论坛 • 上一篇    下一篇

学术文献致谢功能数据集构建与识别方法研究

  

  • 出版日期:2021-11-01 发布日期:2021-11-15

  • Online:2021-11-01 Published:2021-11-15

摘要: 【目的/意义】本文构建了一个大规模学术文献致谢功能数据集,并提出一种基于SciBERT的致谢功能识别
模型,为致谢文本的挖掘和分析提供高质量的数据支持和有效的识别方法。【方法/过程】采用人工的方式扩展和完
善致谢功能分类规则,生成学术文献致谢功能自动标引规则模板,对1,750,275条致谢文本进行功能标引。在此基
础上,采用 SciBERT 模型对致谢文本句进行向量表达,引入 Softmax 回归模型实现致谢功能自动分类,采用
warmup策略进行模型调优,并与基准实验进行对比。【结果/结论】得到一个大规模、高质量的学术文献致谢功能数
据集,经人工检验准确率达到93%;基于SciBERT的识别模型比基准模型表现更好,在扩展数据集上的F1值高于
98%,在各个类别上的预测结果也有不同程度的提升。【创新/局限】致谢功能识别模型缺少对致谢文本独有特征的
考虑和融合。

Abstract: 【Purpose/significance】This paper constructs a large-scale acknowledgements function dataset from academic text,and pro?
poses a SciBERT-based acknowledgements function recognition model,in order to provide high-quality data support and effective rec?
ognition method for acknowledgements mining and analysis.【Method/process】The classification rules of acknowledgements function
are extended and improved manually,then the automatic indexing rules template of acknowledgements function is generated,and 1,750, 275 acknowledgement sentences are labeled.Based on this,the SciBERT model is used to carry out vector representation of acknowl? edgement texts,and a Softmax regression model is introduced to do the automatic text classification.Besides,warmup strategies are ad? opted to optimize the model,and we compared the proposed model with the benchmark methods.【Result/conclusion】We obtained a large-scale and high-quality dataset of acknowledgments function,and the precision of the labeling results reaches 93%.The model proposed in this paper performs better than the latest benchmark experiments.On the expanded dataset,the F1 value of the proposed model is higher than 98%,and the prediction results in various categories have also been improved to varying degrees.【Innovation/limi? tation】The proposed model lacks the consideration and fusion of unique features of acknowledgment texts