情报科学 ›› 2025, Vol. 43 ›› Issue (3): 146-156.

• 博士论坛 • 上一篇    下一篇

数据要素视角下的科学数据非正式引用识别研究

  

  • 出版日期:2025-03-05 发布日期:2025-05-27

  • Online:2025-03-05 Published:2025-05-27

摘要: 【目的/意义】科学数据作为科学研究成果的表现形式之一,多以非正式引用的形式隐藏于学术论文之中。 从学术论文中自动识别数据引用信息从而提取数据要素,为科学数据要素的组织提供了新思路。【方法/过程】为提 高正例文本占比进而提升数据引用句的识别效果,基于生物信息学领域论文全文数据,采用篇章结构识别和数据 增强、随机欠采样、特征词筛选 3种不平衡语料采样方法构建语料集,再分别结合 5种文本分类模型构建数据引用 识别流程。【结果/结论】研究发现,从学术论文中识别数据引用句是细化数据要素组织的有效环节;篇章结构识别 和不平衡语料采样方法可以有效提升数据引用句的识别性能;较之传统的机器学习模型,BERT类深度学习模型在 数据引用文本分类中性能更优。【创新/局限】从学术论文中识别非正式数据引用句为数据要素组织带来新的视角, 是收集高价值数据要素的高效方法。然而,由于论文中数据引用不规范且数量稀疏,分类精确率仍有提升空间。

Abstract: 【Purpose/significance】Scientific data, as one of the expressions of scientific research achievements, is often hidden in aca⁃ demic papers in the form of informal citations. It provides a new idea for organizing scientific data elements by identifying data refer⁃ ence information from academic papers.【Method/process】To improve the proportion of positive text and enhance the identification ef⁃ fect of data citation sentences, based on the full-text data of papers in the field of bioinformatics, three methods for sampling unbal⁃ anced corpora were used: chapter structure recognition and data augmentation, random undersampling, and feature word filtering. Then, five text classification models were combined to build a data citation recognition process.【Result/conclusion】It is found that identifying data citations from academic papers is an effective link to refine the organization of data elements. Text structure recogni⁃ tion and unbalanced corpus sampling can effectively improve the performance of data reference recognition. Compared with traditional machine learning models, BERT-like deep learning model has better performance in data reference text classification.【Innovation/ limitation】Identifying informal data quotes from academic papers brings a new perspective to the organization of data elements and is an efficient method to collect high-value data elements. However, because the data cited in the paper is irregular and sparse, there is still room for improvement in the classification accuracy rate.