情报科学 ›› 2021, Vol. 39 ›› Issue (11): 96-102.

• 业务研究 • 上一篇    下一篇

基于深度学习的细粒度命名实体识别研究 ——以番茄病虫害为例

  

  • 出版日期:2021-11-01 发布日期:2021-11-15

  • Online:2021-11-01 Published:2021-11-15

摘要: 【 目的/意义】开展面向领域的细粒度命名实体识别研究对于提升文本挖掘精度具有重要的意义,本文以番
茄病虫害命名实体为例,探索采用深度学习技术实现面向领域的细粒度命名实体识别研究方法。【目的/意义】文章
以电子书、论文、网页作为数据源,选择品种、病虫害、症状、时间、部位、防治药剂六类实体进行标注,利用BERT和
CBOW 预训练字向量分别输入 BiLSTM-CRF 模型训练,并在识别后补充规则控制实体的边界。【结果/结论】
BERT预训练的字向量和BiLSTM-CRF结合,在补充规则控制后F值达到了81.03%,优于其它模型,在番茄病虫害
领域的实体识别中具有较好的效果。【创新/局限】BERT预训练的字向量可以有效降低番茄病虫害领域实体因分
词错误带来的影响,针对不同实体的特点,补充规则可以有效控制实体边界,提高识别准确率。但本文的规则补充
仅在测试阶段,并没有加入训练过程,整体的准确率还有待提高。

Abstract: 【Purpose/significance】Developing domain oriented fine-grained named entity recognition is of great significance to improve
the accuracy of text mining.Taking tomato diseases and pests as an example,this paper explores the application of deep learning tech?
nology to realize domain oriented fine-grained named entity recognition.【Method/process】In this spaper,E-books,papers and web
pages are used as data sources to label six entities,including species,diseases and pests,symptoms,time,parts and control agents,the
pre-trained vectors of BERT and CBOW are input into BiLSTM-CRF model for training,after recognition,supplementary rules were ap?
plied for entity boundary control.【Result/conclusion】 Combined with pre-trained word vector by BERT and BiLSTM-CRF,the
F-score reached 81.03% after the control of supplementary rules,which is better than other models and has a better effect in entity rec? ognition in the field of tomato pests and diseases.【Innovation/limitation】The word vector pre-trained by BERT can effectively reduce the impact of segmentation errors in tomato pest domain entities.According to the characteristics of different entities,the supplementary rules can effectively control the entity boundary and improve the recognition accuracy.But the rule supplement of this paper is only in the test part,and does not join the training process,so the overall accuracy still needs to be improved.