情报科学 ›› 2021, Vol. 39 ›› Issue (3): 3-10.

• 专论 •    下一篇

基于BERT和TF-IDF的问答社区问句自动标引研究
——以金投网问答社区为例

  

  • 出版日期:2021-03-01 发布日期:2021-03-15

  • Online:2021-03-01 Published:2021-03-15

摘要:

【目的/意义】问答社区问句的自动标引可以为网站的信息组织和信息服务提供有效帮助。目前关于自动
标引的研究大部分集中于抽词标引,并不适用于问答社区问句的自动标引。【方法/过程】本文以金投网问答社区为
例,融合了赋词标引和抽词标引方法,提出了一种基于预训练语言模型BERT和TF-IDF的问答社区问句自动标引
模型。该模型使用基于 BERT 的多标签分类算法对问句进行赋词标引,将问句划分为短问句和长问句,使用
TF-IDF算法对长问句进行抽词标引,补充长问句标引标签。【结果/结论】实验结果表明,本文提出的自动标引模型
可以有效对问答社区问句进行自动标引,对提高用户信息检索效果具有重要的意义。【创新/局限】利用问句内外部
特征构建了基于BERT和TF-IDF的问答社区问句自动标引模型,并提出了一种基于BERT的多标签分类算法。

Abstract:

【Purpose/significance】Automatic indexing of questions in Q&A community can provide effective help for information orga⁃
nization and information service of websites. At present, most researches on automatic indexing focus on extraction indexing, which is
not applicable to the automatic indexing of questions in Q&A community.【Method/process】Based on the CNGOLD Q&A community
as an example, this paper combines the methods of assignment indexing and extraction indexing, and proposes an automatic indexing
model of questions in Q&A community based on BERT and TF-IDF. This model uses the multi-label classification algorithm based on
BERT to assign the questions, divides the questions into short questions and long questions, and uses the TF-IDF algorithm to extract
the long questions and supplement the indexing tags of long questions.【Result/conclusion】 The experimental results show that the au⁃
tomatic indexing model proposed in this paper can effectively automatically index the questions in Q&A community, which is of great
significance to improve the effect of user information retrieval.【Innovation/limitation】Based on the internal and external characteris⁃
tics of questions, this paper constructs an automatic indexing model of questions in Q&A community based on BERT and TF-IDF, and
proposes a BERT based multi-label classification model.