情报科学 ›› 2022, Vol. 40 ›› Issue (10): 164-170.

• 博士论坛 • 上一篇    下一篇

基于加权多策略选样的古文断句模型研究 ——以古籍《宋史》为例 

  

  • 出版日期:2022-10-01 发布日期:2022-10-01

  • Online:2022-10-01 Published:2022-10-01

摘要: 【目的/意义】旨在研究少量标注样本构建古文断句模型,减少在模型训练过程中样本标注所需的成本,为
探索数字技术与人文学科的融合发展提供崭新的思路。【方法
/过程】从古文样本的不确定性和多样性出发,提出一
种加权多策略选样方法,有效结合了
BERT-BiLSTM-CRFBERT-CRF等古文断句模型。通过引入信息熵和相
似性等概念,深入分析古籍文本的不确定性和多样性,运用加权计算评估古文样本对模型训练的价值高低,对加权
多策略方法所筛选的有价值样本进行人工标注,同时更新到训练集进行模型迭代训练。【结果
/结论】以古籍《宋史》
为例进行研究,所提出的方法分别在
BERT-BiLSTM-CRFBERT-CRF等古文断句模型训练过程中减少原来训
练样本量的
50%55%,进一步验证了方法的有效性。【创新/局限】加权多策略选样的方法为古文断句模型训练提供
了一种新思路,未来将探索该方法在古籍整理中其他任务的适用性。

Abstract: Purpose/significanceThe aim of this paper is to study a small number of annotated training samples to construct a sen‐tence segmentation model of ancient texts, reduce the cost of the sample annotation in the process of model training, and provide a new idea for exploring the integration of digital technology and humanities. Method/processBased on the uncertainty and diversity of an‐cient text samples, this paper proposes a weighted multi-strategy sample selection method to train the sentence segmentation model for ancient texts, which effectively combines BERT-BiLSTM-CRF, and BERT-CRF models. Then, based on the concepts of information entropy and similarity, the uncertainty and diversity of ancient texts are analyzed in depth, and the value of ancient text samples for models training is evaluated by weighted quantitative calculation. the weighted multi-strategy sample selection method is applied to
the training of sentence segmentation models of ancient texts, which selects valuable samples and updates them to the model training set after labeling, the model will then be trained iteratively.
Result/conclusionThe ancient book History of Song Dynasty is taken as an example, the proposed method can reduce the original training sample size by 50% and 55% respectively in the training process of ancient text segmentation models such as BERT-BiLSTM-CRF and BERT-CRF, which further verifies the effectiveness of the pro‐posed method.Innovation/limitationThe weighted multi-strategy sample selection method provides a new idea for the training of an‐cient text segmentation models. Especially, it will explore the applicability of the proposed method in other tasks of ancient texts colla‐tion in the future.