rag系统在接触完文档后的第一步,就是需要把文档切割成一个个chunk,通过embedding的方法,存入vector store中。
简单粗暴一点的方法就是固定一个字符长度,按照长度切割成固定长度的chunk,比如说下面用Langchain实现的例子:
from langchain.text_splitter import CharacterTextSplitter
text = "This is the text I would like to chunk up. It is the example text for this exercise"
text_splitter = CharacterTextSplitter(
chunk_size = 35,
chunk_overlap=0,
separator='',
strip_whitespace=False)
text_splitter.create_documents([text])
[Document(page_content='This is the text I would like to ch'),
Document(page_content='unk up. It is the example text for '),
Document(page_content='this exercise')]
但是这样做有几个问题:
针对上面提的问题,有几个优化的思路:
比如说Langchain中RecursiveCharacterTextSplitter对象实现的方法:在chunk size范围内,按照标点符号再切割一次。
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)
text_splitter.create_documents([text])
在通过基础的标点切分成一个个短句之后,判断短句embedding之后的语义相似性,如果相似性较高,则将两个短句合并成一个句子存入vector store中。如果相似性较低,则切割开。
利用cosine similarity计算embedding后的向量距离,相似度越高,距离越近。