情报科学 ›› 2021, Vol. 39 ›› Issue (12): 165-173.

• 博士论坛 • 上一篇    下一篇

面向网络虚假医疗信息的识别模型构建研究 —— 一种基于预训练的BERT模型

  

  • 出版日期:2021-12-01 发布日期:2021-12-29

  • Online:2021-12-01 Published:2021-12-29

摘要: 【 目的/意义】解决获取虚假网络医疗信息数据集时专业知识不足的问题,帮助在小样本领域构建虚假网络
医疗信息识别模型。【方法/过程】本文提出一种基于权威辟谣信息转化提取构建网络虚假医疗信息数据集的思路,
并依次构建传统机器学习模型、CNN模型和BERT模型进行分类识别。【结果/结论】结果表明,基于辟谣信息能够
实现以较低成本、不依赖专家标注构建虚假医疗信息数据集。通过对比实验发现,基于微博数据预训练的 BERT
模型准确率为 95.91%,F1值为 94.57%,相比于传统机器学习模型和 CNN模型提升分别接近 6%和 4%,表明本文构
建的基于预训练的BERT模型在网络虚假医疗信息识别任务上取得了更好的效果。【创新/局限】本文提出的方法能
以较低成本建立专业领域的虚假信息数据集,所构建的BERT虚假医疗信息识别模型在小样本领域也具有实用价
值,但在数据集规模、深度学习模型对比、模型性能评价指标等方面还有待拓展与延伸。

Abstract: 【Purpose/significance】This research aims to solve the problem of insufficient professional knowledge when obtaining online
medical misinformation data sets, and helps build online medical misinformation detection models in the field of small samples.
【Method/process】We propose an idea of constructing an online medical misinformation dataset based on the transformed extraction of authoritative misinformation refuting,and construct the traditional machine learning model,CNN model and BERT model for detection.【Result/conclusion】The results show that the construction of an online medical misinformation dataset based on misinformation refut-ing can be achieved at a lower cost and without relying on expert labeling.The comparative experiments based on the data set of online medical misinformation related to the COVID-19 show that the accuracy rate of the BERT model pre-trained based on Weibo data is 95.91%,and the F1 value is 94.57%,which is compared with traditional machine learning models and CNN model the increase is close to 6% and 4% respectively.It means that the pre-trained BERT-based model constructed in this paper achieves better results on the detection of online medical misinformation task.【Innovation/limitation】The method proposed in this paper can build misinformation data sets in professional fields at a lower cost,and the constructed BERT medical misinformation detection model is also of practical value in the field of small samples.However,this study needs to be optimized in terms of datasets,deep learning model comparison,and model performance evaluation metrics.