BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
https://papers.cool/arxiv/2104.08663
Authors: Nandan Thakur ; Nils Reimers ; Andreas Rücklé ; Abhishek Srivastava ; Iryna Gurevych
Summary: Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.
A: 这篇论文介绍了一个名为BEIR(Benchmarking IR)的异构基准测试,旨在解决现有神经信息检索(IR)模型在分布外(out-of-distribution, OOD)泛化能力评估方面的局限性。具体来说,论文试图解决以下问题:
通过BEIR基准测试,论文希望提供一个框架,以便更好地评估和理解现有检索系统,并促进未来更健壮、更泛化系统的进展。
A: 论文中提到的相关研究主要集中在以下几个方面: