情报科学 ›› 2024, Vol. 42 ›› Issue (7): 137-145.

• 业务研究 • 上一篇    下一篇

基于开源LLMs的中文学术文本标题生成研究 ——以人文社科领域为例

  

  • 出版日期:2024-07-01 发布日期:2024-11-05

  • Online:2024-07-01 Published:2024-11-05

摘要: 【目的/意义】标题作为论文的压缩表示和主旨精髓,在检索、标引等环节中发挥着重要作用。以人文社会 科学领域的学术文本标题生成任务为例,为大语言模型在学术文本挖掘中的应用提供参考。【方法/过程】从实证的 角度出发,探索当前的开源中文大语言模型Qwen-7B在学术文本标题生成任务中的有效性,以及将人文社会科学 领域的学术文本数据知识注入开源基座大语言模型的可行性。使用ROUGE和BLUE指标进行词汇级召回率和准 确率评分,同时使用ChatGPT智能对话系统进行语句流畅度和语义相关性评分。【结果/结论】研究发现将中文人文 社会科学领域的学术文本知识注入 Qwen-7B基座模型中并不能有效提升模型在标题生成任务中的能力,开源基 座大模型Qwen-7B在中文上的特征和语义学习能力有待进一步增强;LLaMA2-7B模型在中文学术文本标题生成 上的能力优于Qwen-7B模型。【创新/局限】基于Qwen-7B模型和人文社会科学领域的学术全文本数据,论证了当 前国内的主流开源大语言模型在学术文本标题生成上的应用能力和应用路径,为学术全文本挖掘和组织提供了理 论与实践参考。本文使用的对照模型和训练方法受资源限制较为单一,有待进一步拓展以充分地探索大语言模型 在学术全文本知识挖掘和组织中的边界。

Abstract: 【Purpose/significance】As a compressed representation and the essence of the main idea of a dissertation, the title plays an important role in searching and citation. Taking the task of academic text title generation in the field of humanities and social sciences as an example, it provides a reference for the application of large language models in academic text mining【. Method/process】From an empirical perspective, we explore the effectiveness capability of the current open-source Chinese large language model Qwen-7B in the task of academic text title generation, and the feasibility of injecting the knowledge of academic text data into the open-source base large language model in the field of humanities and social sciences. Vocabulary-level recall and accuracy scores are performed using ROUGE and BLUE metrics, while utterance fluency and semantic relevance scores are performed using the ChatGPT intelligent dialog system.【Result/conclusion】It is found that injecting academic text knowledge in Chinese humanities and social sciences into the Qwen-7B base model does not effectively improve the model's ability in the title generation task, and the feature and semantic learn⁃ ing ability of the open-source base large model Qwen-7B on Chinese needs to be further enhanced; the LLaMA2-7B model outper⁃ forms the Qwen-7B model in the generation of Chinese academic text titles model.【Innovation/limitation】Based on the Qwen-7B model and academic full text data in the field of humanities and social sciences, the current mainstream open-source large language model in China is demonstrated to have the ability to be applied in the generation of academic text headings and the application paths, which provides theoretical and practical references for the academic full text mining and organization. The control models and training methods used in this paper are relatively homogeneous due to resource constraints, and need to be further extended to fully explore the boundaries of large language models in academic full text knowledge mining and organization.