情报科学 ›› 2025, Vol. 43 ›› Issue (9): 133-138.

• 业务研究 • 上一篇    下一篇

大数据环境下科学文献多维语义跨模态检索算法研究

  

  • 出版日期:2025-09-05 发布日期:2025-12-12

  • Online:2025-09-05 Published:2025-12-12

摘要: 【目的/意义】大数据环境下,科学文献涵盖了大量不同领域、主题和模态的信息,例如文本、图像、音频等, 因此如何有效地滤除多模态冗余数据是现阶段难点之一。【方法/过程】为此,本研究提出大数据环境下科学文献多 维语义跨模态检索算法。采用有偏卡尔曼滤波过滤掉科学文献数据库冗余数据。基于此,采用文本频次-逆文档 频次(Term Frequency-Inverse Document Frequency,TF-IDF)算法提取表征文本的特征词。通过文献检索元素值 计算特征词之间的语义相似度,并结合检索元素的关联度生成检索矩阵,完成大数据环境下科学文献多维语义跨 模态的检索。【结果/结论】实验结果显示,所提算法的检索精度高,NDCG数值高,且检索时间更短。【创新/局限】该 算法的研究对解决传统关键词检索方法的局限性,通过融合多模态数据、利用丰富的语义信息和解决语义鸿沟问 题,提高科学文献检索的效果和准确性,为研究者和学者提供更便捷、全面的信息检索服务。

Abstract: 【Purpose/significance】In the Big data environment, scientific literature covers a large number of information in different fields, themes and modes, such as text, images, audio, etc.【Method/process】Therefore, how to effectively filter multimodal redundant data is one of the difficulties at this stage. Therefore, this study proposes a multi-dimensional semantic cross modal retrieval algorithm for scientific literature in the Big data environment. The biased Kalman filter is used to filter the redundant data of scientific literature database. Based on this, the TF-IDF algorithm is used to extract feature words that represent the text. The semantic similarity between feature words is calculated by the value of Document retrieval elements, and the retrieval matrix is generated by combining the rel⁃ evance of the retrieval elements to complete the multi-dimensional semantic cross modal retrieval of scientific documents in the Big data environment.【Result/conclusion】The experimental results show that the proposed algorithm has high retrieval accuracy, high NDCG values, and shorter retrieval time.【Innovation/limitation】The research on this algorithm aims to address the limitations of tradi⁃ tional keyword retrieval methods by integrating multimodal data, utilizing rich semantic information, and addressing semantic gaps. It improves the effectiveness and accuracy of scientific literature retrieval, providing researchers and scholars with more convenient and comprehensive information retrieval services.