背景

rag系统在接触完文档后的第一步,就是需要把文档切割成一个个chunk,通过embedding的方法,存入vector store中。

简单粗暴一点的方法就是固定一个字符长度,按照长度切割成固定长度的chunk,比如说下面用Langchain实现的例子:

from langchain.text_splitter import CharacterTextSplitter
text = "This is the text I would like to chunk up. It is the example text for this exercise"
text_splitter = CharacterTextSplitter(
   chunk_size = 35,
	 chunk_overlap=0,
	 separator='',
	 strip_whitespace=False)
text_splitter.create_documents([text])
[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

但是这样做有几个问题:

  1. 没有考虑语义结构,比如说换行,问号,句号等。直接将一个完整的句子生硬切开,破坏了结构和语义,影响后续的检索和大模型context learning的效果
  2. 长度和overlap的设置过于magic,很难选择出一个合适的参数。
  3. 适应性不佳,表达不同内容通常而言就是有不同的文本长度,而不是固定死的,介绍一支笔和介绍电脑显然期望的表达的内容长度是有差异的。

针对上面提的问题,有几个优化的思路:

优化思路

在切割的时候考虑标点符合或者其他分隔符

比如说Langchain中RecursiveCharacterTextSplitter对象实现的方法:在chunk size范围内,按照标点符号再切割一次。

from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)
text_splitter.create_documents([text])

通过语义计算,判断split条件

在通过基础的标点切分成一个个短句之后,判断短句embedding之后的语义相似性,如果相似性较高,则将两个短句合并成一个句子存入vector store中。如果相似性较低,则切割开。

Langchain语义切割实现

利用cosine similarity计算embedding后的向量距离,相似度越高,距离越近。