情报科学 ›› 2023, Vol. 41 ›› Issue (10): 12-20.

• 专论 • 上一篇    下一篇

基于敏感语义和复合共现的网络敏感词典构建研究

  

  • 出版日期:2023-10-01 发布日期:2023-12-04

  • Online:2023-10-01 Published:2023-12-04

摘要:

【目的/意义】针对已有敏感词典存在规模小、敏感词分类不全等问题,提出基于敏感语义和复合共现的敏
感词典构建策略。【方法/过程】首先,依据从多个社交平台搜集的敏感词构建初始词集,通过内容分析和类别标注
将初始词集划分为基础敏感词典和候选敏感词集;其次,融合敏感先验概率、敏感语义相关性和复合共现获得敏感
语义扩展词集;最后,依据定义的综合敏感度计算候选敏感词和敏感语义扩展词集的敏感度,实现对候选敏感词的
筛选,完成扩展敏感词典的构建。【结果/结论】与已有的敏感词典相比,本文构建的扩展敏感词典在敏感信息识别
的准确率、召回率和F1值上最高分别提升了17%、24%和22%。【创新/局限】从用于敏感信息识别的重要基础资源入
手,构建了基础敏感词典,并通过综合敏感度筛选出有效的扩展词,实现对基础敏感词典的扩展。不足之处是词语
敏感度的影响指标挖掘不够充分。

Abstract:

【Purpose/significance】In order to address the problems of small scale and incomplete classification of sensitive words in ex⁃
isting sensitive dictionaries, a more complete and reasonable sensitive dictionary is introduced based on sensitive semantics and com⁃pound co-occurrence.【Method/process】Firstly, the initial word set, which is constructed on the basis of the sensitive words collected from multiple social platforms, is divided into the basic sensitive dictionary and the candidate sensitive word set through content analy⁃sis and category annotation. Secondly, a sensitive semantic extension word set is obtained by integrating the sensitive prior probability,sensitive semantic correlation and compound co-occurrence to extend the sensitive semantic information of the candidate sensitive words. Finally, according to the defined comprehensive sensitivity calculation method, the candidate sensitive words are filtered by measuring the comprehensive sensitivity of candidate sensitive words and sensitive semantic extended word set, and the construction of the extended sensitive dictionary is completed.【Result/conclusion】Compared with the existing sensitive dictionary, the accuracy, re⁃call and F1 value of sensitive information recognition is respectively improved by 17%, 24% and 22% based on the expanded sensitive dictionary proposed in this paper, and the comprehensive performance of sensitive information recognition can be effectively improved.【Innovation/limitation】This paper starts with the important basic resources for sensitive information recognition, and constructs a basic sensitive dictionary. The effective extension words are filtered by measuring the comprehensive sensitivity, and the extension of the ba⁃sic sensitive dictionary is implemented. The disadvantage is that the influence index mining of word sensitivity is not enough.