情报科学 ›› 2021, Vol. 39 ›› Issue (11): 103-109.

• 业务研究 • 上一篇    下一篇

面向新时代的人民日报语料中文分词歧义分析

  

  • 出版日期:2021-11-01 发布日期:2021-11-15

  • Online:2021-11-01 Published:2021-11-15

摘要: 【目的/意义】对近几年的人民日报语料中文分词结果进行统计和分析有利于总结新时代的中文语料在分
词歧义方面的规律,提高分词效果,促进中文信息处理的相关研究和技术的发展。【方法/过程】本文以2015年以后
的共4个月新时代的人民日报分词语料为研究对象,通过统计词频、词长、从合度等信息,从名词、动词、数词、量词、
副词、形容词、区别词、方位词、处所词、时间词、代词、介词、连词、助词、习用语、否定词、前后缀等类型来讨论变异
词的切分规律。【结果/结论】结果发现新时代的人民日报语料中的切分变异大部分为假歧义,相同语法结构的二字
词要比三字词、四字词的切分变异从合度更高。【创新/局限】本文首次面向新时代的人民日报语料讨论了中文分词
歧义的问题,但缺少与旧语料的对比分析。

Abstract: 【Purpose/significance】Statistics and analysis of Chinese word segmentation results of People's Daily corpus in recent years
will help to summarize the rules of ambiguity in Chinese word segmentation in the new era,improve the efficiency of word segmentation, and promote the development of related research and technology of Chinese information processing.【Method/process】This paper takes the four month's corpus of NEPD as the research object,and discusses the segmentation rules of variant words such as nouns,verbs,nu? merals,quantifiers,adverbs,adjectives,distinguishing words,location words,time words,pronouns,prepositions,conjunctions,auxiliary words,idioms,negative words,prefixes/suffixes,etc.through statistics of word frequency,word length and congruence.【Result/conclu? sion】The results show that most of the segmentation variations in the corpus are false ambiguity,and the frequency of entire form of the two-syllable words with the same grammatical structure is higher than the three-syllable words and four-syllable words.【Innova? tion/limitation】For the first time,this paper discusses the ambiguity of Chinese word segmentation in the new era People's Daily cor? pus,but it lacks a comparative analysis with the old corpus.