情报科学 ›› 2024, Vol. 42 ›› Issue (3): 80-88.

• 理论研究 • 上一篇    下一篇



  • 出版日期:2024-03-05 发布日期:2024-06-08

  • Online:2024-03-05 Published:2024-06-08


【目的/意义】语料库是一种十分重要跨语言信息检索领域实现翻译的数据来源。在 CLIR 中对语料库进
华尔街日报、金融时报和香港政府等新闻网站搜集中英文网页,使用开源软件HTML Parser过滤掉非文本内容,经
义双语主题特征,通过双语配对搜索,CLIR的性能上将超过 CL-LSI模型检索效率。【创新/局限】本文针对语料库


【Purpose/significance】 Corpora is a very important data source for translation in the field of cross-language information re⁃
trieval. In CLIR, the performance evaluation of corpus, translation and extraction of bilingual dictionaries and semantic disambiguation
can meet the needs of people to acquire knowledge and information.【Method/process】 This paper collects Chinese and English web
pages from news websites such as the Wall Street Journal, the Financial Times and the Hong Kong Government, uses the open-source
software HTML Parser to filter out non-text content, converts the format and finally generates XML files, builds the parallel corpus by
itself, uses CL-LSI and TDS models, and evaluates its performance.【Result/conclusion】 In the establishment of CLIR evaluation cor⁃
pus, it is verified that the TDS model can fully and objectively extract semantic bilingual subject features of semantic association in the
process of bilingual paired search, and the performance of CLIR will exceed the retrieval efficiency of CL-LSI model through bilingual
paired search.【Innovation/limitation】 Aiming at in-depth research on corpora, this paper proposes a cross-language information re⁃
trieval model (TDS) based on dual space in parallel corpora, and collects Chinese and English corpus for a given topic respectively.
The obtained keywords are applied to the TDS model, and the co-occurrence semantic information of bilingual terms is analyzed. Fi⁃
nally, the goal of parallel corpus construction and performance evaluation is realized. The disadvantage is that when the number of bi⁃
lingual topics is small, the accuracy of translation is low, and when the number of topics is gradually increasing, the accuracy of trans⁃
lation is higher.