情报科学 ›› 2025, Vol. 43 ›› Issue (4): 62-71.

• 理论研究 • 上一篇    下一篇

基于图对比学习的跨领域文献信息表示研究

  

  • 出版日期:2025-04-05 发布日期:2025-08-28

  • Online:2025-04-05 Published:2025-08-28

摘要: 【目的/意义】针对基于内容的推荐方法无法充分考虑论文之间或论文与作者之间复杂的隐含关系的问题, 充分考虑学术知识图蕴含的各种关系以及基于内容方法,将两种方法结合,互补各自的不足之处提出一种基于图 对比学习的跨领域文献信息表示方法(Graph Contrast Learning for Information Representation Method,GCLIRM), 旨在解决文献信息表示方法单一,信息表示不充分等问题。【方法/过程】该方法同时考虑异构图和同构图,异构图 网络首先使用重启随机游走算法得到节点序列,采用Skip-gram算法初始化节点表示,设计两级注意力机制用于节 点重要性和元路径重要性建模,以学习异构图节点表示;其次,由预训练大模型ERNIE3.0建模同构图节点特征,再 通过GAT聚合邻居信息并使用自编码器进行无监督训练;最后设计对比学习策略学习最终的节点特征,即文献信 息表示。【结果/结论】本文 GCLIRM 方法在学科分类和期刊分类两个下游任务中 F1值分别比次优解提升了 5.31% 和 2.49%,表示本文方法对跨领域文献信息的表示能力有较大的提升。【创新/局限】本文设计了一个提取文献信息 表示的混合方法,克服了单一方法表示不充分的难题,并且具有一定的可行性和准确性,为后续相关研究提供借鉴 思路。

Abstract: 【Purpose/significance】Aiming at the problem that the content-based recommendation method fails to fully consider the complex implicit relationship between papers or between papers and authors, the relationships contained in academic knowledge maps and the content-based method are fully considered, and a cross-domain literature information representation method based on graph comparative learning is proposed by combining these two methods and complementing their respective shortcomings (Graph Contrast Learning for Information Representation Method,GCLIRM). It aims to solve the problems of single method and insufficient information representation in literature.【Method/process】In this method, both heterogeneous graphs and homogeneous graphs are considered. In heterogeneous graph networks, node sequences are obtained using restart random walk algorithm, and node representations are initial⁃ ized using Skip-gram algorithm. A two-level attention mechanism is designed for node importance modeling and meta-path impor⁃ tance modeling to learn node representations of heterogeneous graphs. Secondly, the pre-trained large model ERNIE3.0 is used to model the node features of the same frame, and then the neighbor information is aggregated by GAT and unsupervised training is per⁃ formed by autoencoder. Finally, comparative learning strategies are designed to learn the final node features, that is, document informa⁃ tion representation.【Result/conclusion】The F1 value of GCLIRM method in the two downstream tasks of subject classification and journal classification was increased by 5.31% and 2.49% compared with the sub-optimal solution, respectively, indicating that the method in this paper has greatly improved the representation ability of cross-domain literature information.【Innovation/limitation】 This paper designs a hybrid method to extract literature information representation, which overcomes the problem of inadequate repre⁃ sentation of a single method, and has certain feasibility and accuracy, providing reference ideas for subsequent relevant research.