情报科学 ›› 2024, Vol. 42 ›› Issue (9): 51-60.

• 理论研究 • 上一篇    下一篇

基于BERTopic模型的科技报告主题挖掘与演化分析 ——以生物技术领域为例

  

  • 出版日期:2024-09-01 发布日期:2024-11-06

  • Online:2024-09-01 Published:2024-11-06

摘要: 【目的/意义】科技报告数据是国家基础性战略资源,研究对其开发和利用的技术和方法势在必行。通过识 别生物技术领域的研究主题及其演化过程,能够填补科技报告数据的开发和利用场景。【方法/过程】构建生物技术 领域科技报告文本语料库,训练BERTopic主题模型,进行领域研究主题挖掘与演化研究。【结果/结论】基于BER⁃ Topic主题模型共识别出生物技术领域30个主题,通过主题层次聚类法解析了生物技术领域9大研究方向,即植物 基因组学和基因改造、基因工程和工业生物技术、生物技术在生物和生态环境中的应用、兽医病毒学和免疫学、分 子遗传学和生物化学、心血管代谢健康及神经生物学、骨生物学和再生医学、生物医学和临床研究。【创新/局限】所 构建模型能够更好地识别科技报告数据中所呈现的研究主题,生成的生物技术领域主题描述文本质量较好。语料 库对科技报告数据中的摘要和时间字段进行语义分析,并未对其他字段进行分析。

Abstract: 【Purpose/significance】Scientific and technological report data is a fundamental strategic resource for the country, and it is im⁃ perative to research the technologies and methods for its development and utilization. By identifying the research themes in the field of biotechnology and their evolution, we can fill the gap in the development and application of scientific report data【. Method/process】A text corpus of scientific reports in the biotechnology field was constructed, and a BERTopic model was trained to explore and analyze the evo⁃ lution of research themes in the field.【Results/conclusion】A total of 30 themes in the biotechnology field were identified based on the BERTopic model. The hierarchical clustering of themes revealed nine major research directions in biotechnology: plant genomics and ge⁃ netic modification, genetic engineering and industrial biotechnology, the application of biotechnology in biological and ecological environ⁃ ments, veterinary virology and immunology, molecular genetics and biochemistry, cardiovascular metabolic health and neurobiology, bone biology and regenerative medicine, and biomedical and clinical research【. Innovation/limitation】The constructed model can better iden⁃ tify the research themes presented in scientific report data, and the quality of the generated thematic descriptions in the field of biotech⁃ nology is satisfactory. The corpus performs semantic analysis on the abstracts and time fields in the scientific report data but does not ana⁃ lyze other fields.