情报科学 ›› 2021, Vol. 39 ›› Issue (10): 178-184.

• 博士论坛 • 上一篇    下一篇

多层次文本分类法的模型构建及实验分析 ——以进出口商品归类问题为例

  

  • 出版日期:2021-10-01 发布日期:2021-11-01

  • Online:2021-10-01 Published:2021-11-01

摘要: 【目的/意义】政府机构的数据规模在数字时代得到了空前的增长,这也为多类目政务数据的自动化处理工
作带来了挑战。在此背景下,本文通过引入多层次文本分类方法,对进出口商品的自动归类问题进行了探索。【方
法/过程】基于HS编码的层次结构,构建了一个包含三个层次的分类模型,通过逐层判别进而累加的方式进行文本
分类;同时,对SVM与TextRNN等算法的分类效果进行了对比。【结果/结论】多层次分类模型对于解决商品归类问
题的总体效果较好;在数据充分的情况下,TextRNN 比 SVM 的效果相对较好(第 1 层 93.00%>92.90%,第 2 层
96.46%>96.38%),而在学习不充分的环境下,SVM具有较大优势(第3层92.49%<95.92%);SVM取得了85.88%的最
佳叠加正确率。【创新/局限】本研究尝试基于多层次分类方法解决商品自动归类问题,但数据规模及应用场景仍有
待拓展。

Abstract: 【Purpose/significance】The data scale of government agencies has been growing unprecedentedly in the digital era, which al?
so brings challenges to the automatic processing of multi category government data. In this context, this paper introduces a multi-level text classification method to solve the problem of automatic classification of import and export commodities.【Method/process】Based on the hierarchical structure of HS coding, a three-level classification model is constructed, which classifies text by layer-by-layer dis? crimination and accumulation. At the same time, the classification effect of SVM and TextRNN algorithm is compared.【Result/conclu? sion】The overall effect of multi-level classification models is fine for solving the problem of commodity classification. In the case of sufficient data, the effect of TextRNN is better than SVM (93.00% > 92.90% in the first layer and 96.46% > 96.38% in the second lay? er). However, in the case of insufficient learning, SVM has greater advantages (92.49% < 95.92% in the third layer). SVM achieves the best superposition accuracy of 85.88%.【Innovation/limitation】This paper attempts to solve the problem of automatic commodity classi? fication based on multi-level classification, but the data scale and application scenario still need to be expanded.