LM-Indexer | Notion

Language Models As Semantic Indexers

Authors: Bowen Jin ; Hansi Zeng ; Guoyin Wang ; Xiusi Chen ; Tianxin Wei ; Ruirui Li ; Zhengyang Wang ; Zheng Li ; Yang Li ; Hanqing Lu ; Suhang Wang ; Jiawei Han ; Xianfeng Tang

Summary: Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. Nevertheless, it is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMINDEXER, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. The learned semantic indexer can facilitate various downstream tasks, such as recommendation and retrieval. We conduct experiments on three tasks including recommendation, product search, and document retrieval on five datasets from various domains, where LMINDEXER outperforms competitive baselines significantly and consistently.

Q: 这篇论文试图解决什么问题？

A: 这篇论文提出了一个名为LMINDEXER的自监督框架，旨在解决信息检索（IR）中语义标识符（Semantic IDs）的学习问题。语义标识符是一种能够捕捉文档内容语义的离散ID序列，对于文档的理解和检索具有重要意义。然而，现有的研究通常采用两阶段方法来学习语义标识符，这可能导致信息损失，并且文本编码器产生的嵌入分布与语义索引所需的分布存在不匹配。此外，语义标识符的学习和分配缺乏足够的监督信号，使得设计一个能够同时学习文档语义表示和层次结构的方法变得具有挑战性。

LMINDEXER框架通过以下几个关键点来解决这些问题：

自监督学习：通过自监督的方式直接从文档中学习语义标识符，而不是依赖于预训练的文本编码器和后续的ID生成技术。
生成式语言模型：利用生成式语言模型来捕捉文档的语义表示和层次结构，这有助于更好地理解和索引文档内容。
顺序离散ID的挑战：通过引入能够生成神经序列离散表示的语义索引器，以及通过渐进式训练和对比学习来处理顺序离散ID的复杂性。
语义监督不足：为了应对缺乏语义监督的问题，论文提出了一种自监督的目标，即通过文档重建任务来训练模型，使得语义索引器能够学习到能够准确重建原始文档的语义标识符。
下游任务的适应性：学习到的语义索引器不仅可以用于生成语义标识符，还可以针对不同的下游任务（如推荐和检索）进行微调，以提高任务性能。

总的来说，这篇论文的目标是开发一种能够更有效地学习语义标识符的方法，以改善信息检索系统中文档的理解和索引。

Q: 有哪些相关研究？

A: 论文中提到的相关研究主要集中在以下几个方面：

自监督学习与语言模型：BERT (Devlin et al., 2019) 引入了掩码语言建模和下一句预测作为自监督训练目标。RoBERTa (Liu et al., 2019) 强调了掩码语言建模的重要性。ELECTRA (Clark et al., 2020) 提出了判别性语言建模，其中生成器在句子中插入伪标记，而鉴别器区分原始标记和被替换的标记。其他研究提出了各种自监督学习目标，如自回归因果语言建模 (Brown et al., 2020)、排列语言建模 (Yang et al., 2019)、纠正和对比 (Meng et al., 2021) 以及文本到文本的转移建模 (Raffel et al., 2020)。这些研究主要关注于使用自监督学习来训练语言模型进行自然语言理解和生成。
语义索引器：最初在计算机视觉中引入的语义索引器 (Van Den Oord et al., 2017; Lee et al., 2022; Esser et al., 2021) 将输入图像转换为一组捕捉原始图像本质的ID。在信息检索任务中，如文档检索 (Tay et al., 2022) 和推荐 (Rajput et al., 2023)，这些ID被用来表示文档，并被应用于生成推荐 (Hua et al., 2023) 和检索 (Sun et al., 2023)。然而，这些ID的发展高度依赖于下游任务的先验知识或监督。
自监督语义索引方法：当前的自监督语义索引方法通常遵循两步过程。首先，使用现成的文本编码器（如BERT）对输入文档进行编码并生成嵌入表示。然后，使用如rq-VAE (Rajput et al., 2023) 或层次聚类 (Tay et al., 2022; Wang et al., 2022) 等技术，基于第一步得到的嵌入来为文档创建ID。这些方法通常存在信息损失问题，并且文本编码器产生的嵌入分布与语义索引所需的分布不匹配。