情报科学 ›› 2025, Vol. 43 ›› Issue (7): 86-96.

• 理论研究 • 上一篇    下一篇

面向多源异构医疗健康数据的知识图谱构建研究

  

  • 出版日期:2025-07-05 发布日期:2025-10-16

  • Online:2025-07-05 Published:2025-10-16

摘要:

【目的/意义】探究面向多源异构医疗健康数据的知识抽取与知识融合方法,优化知识图谱自动化构建流
程,旨在提高整合多源异构数据的能力与用户检索信息的效率和质量。【方法/过程】首先以UMLS的医学术语和医
学概念为基础,并采用 BTM 主题模型对非结构化数据进行主题分析,辅助进行本体设计;然后通过对比 BERT
Base、BioBERT、MC-BERT作为CasRel模型编码端的嵌入效果,选择效果最佳的MC-BERT-CasRel模型对非结
构化数据中的实体与关系三元组进行抽取,并对半结构化数据进行数据重组,建立实体间的关联关系;接着采用
SapBERT模型与Levenshtein编辑距离算法对三元组进行数据融合;最终完成知识图谱构建。【结果/结论】基于本文
提出方法最终构建了包含 10010个实体和 29044个关系的“消化系统疾病”知识图谱,并实现了知识检索应用。【创
新/局限】本文聚焦多源异构医疗健康数据的整合,为互联网环境下医疗健康垂直领域的知识图谱构建流程提供了
新的思路和方法借鉴。但数据源的样本量有限,未来可考虑在更大规模数据集上进行知识抽取及知识融合。

Abstract:

【Purpose/significance】To explore the knowledge extraction and knowledge fusion methods for multi-source heterogeneous
medical health data, optimize the automatic construction process of knowledge graph, and improve the ability to integrate multi-source
heterogeneous data and the efficiency and quality of information retrieval by users【. Method/process】Firstly, based on UMLS medical
terms and medical concepts, the topic analysis of unstructured data is carried out by using BTM topic model to assist ontology design.
Then, by comparing the embedding effect of BERT-Base, BioBERT, and MC-BERT as the coding end of CasRel model, the Mc-bert
casrel model with the best effect is selected to extract the entity and relational triad from the unstructured data, and data reorganization
is carried out on the semi-structured data. Establish the relationship between entities; Then SapBERT model and Levenshtein editing
distance algorithm are used for data fusion of triples. Finally complete the construction of knowledge graph【. Result/conclusion】Based
on the method proposed in this paper, the knowledge map of "digestive system diseases" containing 10010 entities and 29044 relation⁃
ships is constructed, and the knowledge retrieval application is realized【. Innovation/limitation】This paper focuses on the integration of
multi-source heterogeneous medical and health data, and provides a new idea and method for the construction process of knowledge
graph in the vertical field of medical and health under the Internet environment. However, the sample size of data sources is limited, so
knowledge extraction and knowledge fusion on larger data sets can be considered in the future.