情报科学 ›› 2024, Vol. 42 ›› Issue (11): 76-83.

• 理论研究 • 上一篇    下一篇

基于BERT的领域分词优化高校图书馆借阅热点分析

  

  • 出版日期:2024-11-01 发布日期:2025-04-08

  • Online:2024-11-01 Published:2025-04-08

摘要: 【目的/意义】图书馆借阅数据的变化反映了当年借阅者关注重点的变化,一定程度上能够体现整个社会的 研究关注热点。本文通过大语言模型建立高校图书馆图书借阅预约数据各字段与社会热点之间的关系模型,探索 借阅数据与社会热点之间的关系,辅助实现对一段时间内社会热点的分析。【方法/过程】首先,采用编码—解码的 结构构建关于图书题名的分词模型,利用大型的分词数据集进行训练,获取原始词频,然后根据字段中的读者院系 和索书号进行领域匹配,最后,从借阅次数、预约持续时间和所属领域三个角度对原始词频进行权重更新,得到最 终的与社会热点有关的热点词云。【结果/结论】本文首先对分词模型进行了实验,实验表明本文算法在MSR、PKU、 CTB6三个数据集上F值明显优于其他算法,其中,在CTB6分词数据集上,本文算法F值达到97.18,高于CRF算法 3.15个百分点,加入领域优化后的分词算法在专业性较强的文本上分词的性能更好。然后本文对图书馆借阅数据 和预约数据进行了实验分析,展现了基于领域分词优化的热点词云生成框架的先进性,实验表明本文算法生成的 热点词与社会热点能建立一定联系。补充【创新/局限】本文研究了图书借阅数据和预约数据的字段特点,创新性 地提出了基于 BERT的领域分词优化借阅热点生成框架。虽然本文利用了图书馆的数据字段特性构建了热点词 云生成框架并且优化了词云生成结果,但是对于热点词云生成的性能没有一个量化的指标,接下来需要进行更多 的探索和研究。

Abstract: 【Purpose/significance】The changes in library borrowing data reflect the key concerns of borrowers at that time, and to a cer⁃ tain extent, can reflect the research hotspots of the entire society. This article aims to establish a relationship model between various fields of book borrowing reservation data in university libraries and social hotspots through a large language model, explore the relationship be⁃ tween borrowing data and social hotspots, and assist in the analysis of social hotspots over a period of time.【Method/process】Firstly, a word segmentation model for book titles is constructed using an encoding decoding structure. A large word segmentation dataset is used for training to obtain the original word frequency. Then, domain matching is performed based on the reader′s department and call number in the field. Finally, the weight of the original word frequency is updated from three perspectives: borrowing frequency, reservation dura⁃ tion, and domain, to obtain the final hot word cloud related to social hotspots.【Result/conclusion】This paper first conducted experiments on the segmentation model, and the experiments showed that the algorithm in this paper had a significantly better F-value than other algo⁃ rithms on the MSR, PKU, and CTB6 datasets. Among them, on the CTB6 segmentation dataset, the F-value of the algorithm in this paper reached 97.18, which is 3.15 percentage points higher than the CRF algorithm. The segmentation algorithm with domain optimization per⁃ formed better on texts with strong professionalism. Then this paper makes an experimental analysis of library borrowing data and reserva⁃ tion data, and shows the progressiveness of the hot word cloud generation framework based on domain segmentation optimization. The ex⁃ periment shows that the hot word generated by the algorithm in this paper can establish a certain relationship with social hot spots【. Inno⁃ vation/limitation】This article studies the field characteristics of book borrowing data and reservation data, and innovatively proposes a borrowing hotspot generation framework based on BERT domain segmentation optimization. Although this paper uses the data field char⁃ acteristics of the library to build a hotspot word cloud generation framework and optimizes the word cloud generation results, there is no quantitative indicator for the performance of hot word cloud generation, and more exploration and research are needed.