情报科学 ›› 2025, Vol. 43 ›› Issue (1): 127-136.

• 业务研究 • 上一篇    下一篇

融合BERTopic和Prompt的学者研究兴趣生成模型 ——以计算机科学领域为例

  

  • 出版日期:2025-01-05 发布日期:2025-06-27

  • Online:2025-01-05 Published:2025-06-27

摘要: 【目的/意义】学者研究兴趣是学者画像的关键特征,本研究通过识别学者研究兴趣的变化过程,能够帮助 补齐学术履历,对构建完整的学者画像以及面向前沿需求的精准人才发现具有重要意义。【方法/过程】构建计算机 科学领域论文文本语料库,训练 BERTopic 主题模型,进行领域研究主题挖掘和学者研究兴趣特征识别。创建 Prompt,利用LLM进行主题词提取,结合主题模型分析结果,进行学者研究兴趣描述。【结果/结论】对于学者研究兴 趣描述任务,相较基准模型,融合模型的ROUGE得分平均相对提升8.2%,BERTScore得分相对提升4.5%。通过层 次分析法发现,BERTopic与LLM融合模型的学者研究兴趣识别效果优于其他评测模型,模型人工评测满意度达到 81.4%。【创新/局限】所构建模型能够更好地识别学者研究主题,生成的学者研究兴趣描述文本质量较高。使用的语 料库内中文语料占比较大,模型对外文成果的识别能力欠佳。

Abstract: 【Purpose/significance】Scholar's research interest is the key feature of scholar's profile. By identifying the changing pro⁃ cess of scholar's research interest, this study can help complete academic resumes, which is of great significance for constructing com⁃ plete scholars' portraits and accurate talent discovery for cutting-edge needs.【Method/process】A text corpus comprising papers from computer science was constructed to train the BERTopic topic model for mining research topics and identifying the characteristics of scholars' research interests. Additionally, a Prompt was created to extract subject words using LLM, which were then combined with the analysis results from the topic model to describe scholars' research interests.【Result/conclusion】Compared to the baseline model, our fusion model achieved an average relative increase of 8.2% in ROUGE score and 4.5% in BERTScore for describing scholars' re⁃ search interests. Through Analytic Hierarchy Process (AHP), it was determined that our fusion model combining BERTopic and LLM outperformed other evaluation models in recognizing scholars' research interests with a manual evaluation satisfaction rate reaching 81.4%.【Innovation/limitation】The proposed model demonstrates improved capability in identifying scholars' research topics while generating high-quality descriptions of their research interests.The utilized corpus predominantly consists of Chinese texts; hence, this model exhibits limited recognition ability for foreign language papers.