情报科学 ›› 2023, Vol. 41 ›› Issue (4): 51-61.

• 理论研究 • 上一篇    下一篇

基于Rule-Faster-RCNN的多语科技论文PDF文档结构框架元素解析

  

  • 出版日期:2023-05-01 发布日期:2023-05-19

  • Online:2023-05-01 Published:2023-05-19

摘要: 【目的/意义】PDF文档能够如实地保存原文档的内容和外观,但是也给文档的解析带来了很大的困难。为
了更为全面、自动地挖掘多语科技论文的文本信息,本文对多语论文PDF文档的有效解析和知识抽取方法进行了
探究。【方法/过程】本文提出一种基于Rule-Faster-RCNN的多语科技论文PDF文档结构框架元素的解析方法,将
科技论文全文的结构框架元素分为文本元素和图表元素,采用规则辅以 Faster-RCNN 深度学习方法分别进行提
取,其中规则方法利用论文行文版式特点识别文字框架元素和图表元素,深度学习方法将图表识别看作目标检测
构建Faster-RCNN网络来补充规则方法的不足。【结果/结论】经过实验验证了本文提出的PDF解析方法优于基准
方法,成功地获取了科技论文的有效全文知识。【创新/局限】本文采用规则辅以深度学习方法更为精细地将多语科
技论文的全文文档结构框架元素进行了提取,并验证了方法的有效性;然而限于PDF文档的复杂程度,表元素仅作
为图片进行了提取,未能深入到表格内部的文本信息。

Abstract: 【Purpose/significance】PDF documents can faithfully preserve the content and appearance of the original document, but it
brings great difficulties to the parsing of the document. In order to mine multilingual scientific and technological texts more compre?
hensively and automatically, the methods of effective parsing and knowledge extraction for PDF documents of multilingual scientific
papers are explored.【Method/process】This paper proposes a Rule-Faster-RCNN-based method for the structural frame elements of
multilingual scientific papers in PDF file format. The structural frame elements of papers' full text are divided into text elements and
graph-table elements, and the rules are combined with deep learning methods to extract them respectively. The rule method uses the
characteristics of the paper layout to identify the text frame elements and graph-table elements, the deep learning method regards
graph-table recognition as target detection and constructs a Faster-RCNN network to supplement the deficiencies of the rule method.
【Result/conclusion】Experiments show that the PDF parsing method proposed in this paper is superior to the baseline method, and the full-text knowledge of scientific papers is successfully obtained effectively.【Innovation/limitation】The rules and deep learning meth? ods are used to extract the structural framework elements of the full-text documents of multilingual scientific papers more precisely, and the effectiveness of the method is verified by experiments; however, due to the complexity of PDF documents, the table elements are only extracted as pictures, and the text information inside the table cannot be mined.