情报科学 ›› 2021, Vol. 39 ›› Issue (7): 91-98.

• 业务研究 • 上一篇    下一篇

基于Doc2vec和SVM的作者姓名消歧研究 ——以PubMed Central为例

  

  • 出版日期:2021-07-16 发布日期:2021-07-16

  • Online:2021-07-16 Published:2021-07-16

摘要: 【目的/意义】为解决重名作者姓名识别问题,提升作者姓名消歧准确率。【方法/过程】本文着重在整合作
者单位、邮箱等信息特征的基础上抓住作者在研究方向和研究内容上的承接性和演进性,提出构建综合文章题目、
关键词、摘要、引文以及作者的合作列表、邮箱、机构等附属信息的作者语料集,利用Doc2ve进行深度本文表示学
习,在特征学习的基础上利用支持向量机(SVM)根据人工标注的样本进行模型训练和学习,以 PubMed Central
(PMC)全部数据为例,在得到局部较优结果的基础上,将模型用于PMC所有数据集。【结果/结论】结果显示本文提
出的姓名消歧方法准确率达91.80%,有效提升了消歧的准确率,该方法不仅把握了传统作者机构、邮箱、合作列表
等特征信息,而且根据作者研究内容的承接性和演进性追溯作者,整合多方面特征以解决单单依据单位、邮箱等信
息消歧失效问题,面对学者流动性的增强展示出其更强的应用前景。【创新/局限】本研究将每个作者分别包装成一
个个文档,以此包含作者的所有属性以及相关信息,通过无监督文本表示学习和有监督机器学习结合的模式完成
消歧任务,在生命科学与医学领域数据方面具有较好的适用性。

Abstract: 【Purpose/significance】Aiming to solve the problem of recognize the authors from same names, and promote the accuracy of
author name disambiguation.【Method/process】This paper proposes to deep express the text features from title, keywords, abstract,
reference, co-authors, email, and affiliation information, utilizing Doc2vec, and then to train and test nine SVM models based on the
features from Doc2vec. Taking PubMed Central data as an example, this paper disambiguates all author once the model has higher ac?
curacy in test.【Result/conclusion】As results shown, this method can effectively promote the accuracy of author recognition, up to
91.80%. Our method not only takes full use of the features from affiliation, email and co-author, but also follows the succession and
evolution of research content, integrates multiple features to solve the problem of author name disambiguation, especially for no affilia?
tion and email, which shows more potential application when authors flow more frequently.【Innovation/limitation】The integration
method of unsupervised text representation learning and supervised machine learning, show good performance for the author data from biomedical and life science