情报科学 ›› 2025, Vol. 43 ›› Issue (3): 91-98.

• 业务研究 • 上一篇    下一篇

基于机器学习的汉语儿童阅读材料可读性评估方法研究

  

  • 出版日期:2025-03-05 发布日期:2025-05-27

  • Online:2025-03-05 Published:2025-05-27

摘要: 【目的/意义】文本可读性是衡量阅读材料难度的重要指标,发展汉语文本可读性评估方法,并利用该方法 对儿童阅读材料进行评估,可以为不同阅读水平的读者筛选出难度适宜的读物,从而有效提升阅读能力。【方法/过 程】本文以小学语文教材为测试样本,基于4种传统机器学习算法(线性回归、支持向量机、决策树分类器、K-最近 邻)、5种集成学习算法(随机森林分类器、极度随机树、AdaBoost、Bagging和XGBoost)和1种人工神经网络的多层感 知构建了 10 个文本可读性分类器,并对其进行对比评估。【结果/结论】研究发现基于集成学习策略的随机森林、 Bagging和XGBoost分类器较其他分类器具有更高的分类准确性,其交叉验证的最大准确率和F1值均超过了0.75。 特别是基于随机森林模型构建的文本可读性分类器在小学语文教材可读性预测方面表现出优异的性能,其交叉验 证的最大准确率和F1值都超过了0.76。【创新/局限】本研究为汉语儿童读物的难度评估和材料筛选提供了有效工 具,未来拟采集更多的汉语文本数据,结合更先进的深度学习算法,进一步提高汉语文本可读性分类器的准确度和 适用范围。

Abstract: 【Purpose/significance】The assessment readability of text plays an important role in measuring the difficulty of reading ma⁃ terials. The development of Chinese text readability assessment method and the use of this method to evaluate children′s reading mate⁃ rials can screen out suitable reading materials for different readers, so as to effectively improve reading ability.【Method/process】This paper uses primary school Chinese textbooks as database. Ten classifying models are built based on four traditional machine learning algorithms (linear regression, support vector machine, decision tree classifier, K-nearest neighbor), five ensemble learning algorithms (random forest classifier, extremely random tree, AdaBoost, Bagging and XGBoost) and one artificial neural network.【Result/conclu⁃ sion】It is found that the models based on random forest, Bagging and XGBoost with better performence than other models, and their maximum accuracy and F1 value of cross-validation are more than 0.75. In particular, the text readability classifier based on random forest classification model shows excellent performance in the readability classification prediction of primary school Chinese teaching materials, with the maximum accuracy and F1 value of cross-validation exceeding 0.76【. Innovation/limitation】This study provides an effective tool for the assessment of the readability of text and material screening of Chinese children′s books. It is planned to collect more Chinese text data in the future, and combine with more advanced deep machine learning algorithms to further improve the accu⁃ racy and scope of application of the Chinese text readability classifier.