Language Models As Semantic Indexers
https://papers.cool/arxiv/2310.07815
Authors: Bowen Jin ; Hansi Zeng ; Guoyin Wang ; Xiusi Chen ; Tianxin Wei ; Ruirui Li ; Zhengyang Wang ; Zheng Li ; Yang Li ; Hanqing Lu ; Suhang Wang ; Jiawei Han ; Xianfeng Tang
Summary: Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. Nevertheless, it is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMINDEXER, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. The learned semantic indexer can facilitate various downstream tasks, such as recommendation and retrieval. We conduct experiments on three tasks including recommendation, product search, and document retrieval on five datasets from various domains, where LMINDEXER outperforms competitive baselines significantly and consistently.
A: 这篇论文提出了一个名为LMINDEXER的自监督框架,旨在解决信息检索(IR)中语义标识符(Semantic IDs)的学习问题。语义标识符是一种能够捕捉文档内容语义的离散ID序列,对于文档的理解和检索具有重要意义。然而,现有的研究通常采用两阶段方法来学习语义标识符,这可能导致信息损失,并且文本编码器产生的嵌入分布与语义索引所需的分布存在不匹配。此外,语义标识符的学习和分配缺乏足够的监督信号,使得设计一个能够同时学习文档语义表示和层次结构的方法变得具有挑战性。
LMINDEXER框架通过以下几个关键点来解决这些问题:
总的来说,这篇论文的目标是开发一种能够更有效地学习语义标识符的方法,以改善信息检索系统中文档的理解和索引。
A: 论文中提到的相关研究主要集中在以下几个方面: