情报科学 ›› 2023, Vol. 41 ›› Issue (9): 138-145.

• 业务研究 • 上一篇    下一篇

基于Labeled-LDA模型的科学数据与科技文献关联识别研究
——以生物医学领域为例

  

  • 出版日期:2023-09-01 发布日期:2023-10-07

  • Online:2023-09-01 Published:2023-10-07

摘要: 【 目的/意义】在万物互联的开放科学时代,建立科学数据与科技文献之间的关联成为推动科学数据开放获 取、共享和重用的重要举措。【方法/过程】本研究基于Labeled-LDA模型,辅以基于规则的识别方法,构建科学数据 与科技文献关联识别模型,并以生物医学领域为例分别针对规范化引用、非规范化引用以及无引用三种关联情况 进行模型训练与测试。【结果/结论】研究发现本模型在识别规范化引用测试集时识别率和 F值分别为 0.9和 0.5左 右,有比较稳定的识别效果,在识别非规范化引用和无引用的测试集时识别率分别为0.465和0.5,也展现出较强的 可移植性与应用潜力。通过对非规范化引用和无引用识别结果进行人工判断,发现科学研究中确实存在数据引用 不规范的现象,需要学界共同推动数据引用规范化。【创新/局限】与其他研究相比,本文构建的模型为基于语义的 关联识别提供了方法层面的参考和基础,可以应用于大规模语料研究,从而促进更深层次语义关联的知识发现。

Abstract: 【 Purpose/significance】In the era of Open Science in which everything is interconnected, linking scientific data and scien⁃ tific literature has become an important measure to promote the open access, acquisition, sharing and reuse of scientific data.【 Method/ process】In order to open up a solution path of identifying and extracting the hidden linkage between scientific data and scientific lit⁃ erature, this paper constructs the linkage recognition model between scientific data and scientific literature based on labeled-LDA model and rule-based recognition method. Taking biomedical papers and scientific data as the research object, this paper carries out model training and testing for the three association cases of standardized citation, non-standardized citation and no citation through text mining.【 Result/conclusion】The results show that the F value of the model is about 0.5 when identifying the standardized refer⁃ ence test set, which has a relatively stable recognition effect. When identifying the non-standardized reference test set and the nonreferenced test set, the recognition rates are 0.465 and 0.5 respectively, showing strong portability and great application potential. Through the manual judgment of the recognition results of non- standardized references and non-references, it is found that there is in⁃ deed the phenomenon of non-standard data references in scientific research, which needs the academic community to jointly promote the standardization of data references. 【Innovation/limitation】Compared with other studies, the model constructed in this paper pro⁃ vides a methodological reference and basis for semantic based association recognition, and can be applied to large-scale corpus re⁃ search, so as to promote the knowledge discovery of deeper semantic association.