为什么RecursiveCharacterTextSplitter没有提供任何块重叠?

huangapple go评论56阅读模式
英文:

Why is RecursiveCharacterTextSplitter not giving any chunk overlap?

问题

我正在尝试创建最大长度为350个字符,重叠100个字符的块。

我了解到chunk_size是一个上限,所以我可能会得到比这更短的块。但为什么我没有得到任何chunk_overlap

是因为重叠也必须在一个分隔符字符之一上分割吗?所以如果在分割点的100个字符内有一个separator,它才会分割成100个字符的重叠块?

from langchain.text_splitter import RecursiveCharacterTextSplitter

some_text = """当撰写文件时作者将使用文件结构来组织内容\
这可以传达给读者哪些想法相关例如密切相关的想法在句子中相似的想法在段落中段落构成文件\n\n\
段落通常用回车或两个回车符界定回车符是您在此字符串中嵌入的反斜杠n”。\
句子末尾有句点但也有一个空格而且单词之间用空格分隔"""

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=350,
    chunk_overlap=100,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
x = r_splitter.split_text(some_text)
print(x)
for thing in x:
    print(len(thing))

输出

["当撰写文件时作者将使用文件结构来组织内容这可以传达给读者哪些想法相关例如密切相关的想法在句子中相似的想法在段落中段落构成文件",
'段落通常用回车或两个回车符界定。回车符是您在此字符串中嵌入的“反斜杠n”。句子末尾有句点,但也有一个空格。而且单词之间用空格分隔。']
248
243
英文:

I am trying to create chunks (max) 350 characters long with 100 chunk overlap.

I understand that chunk_size is an upper limit, so I may get chunks shorter than that. But why am I not getting any chunk_overlap?

Is it because the overlap also has to split on one of the separator chars? So it's 100 chars chunk_overlap if there is a separator within 100 chars of the split that it can split on?

from langchain.text_splitter import RecursiveCharacterTextSplitter

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=350,
    chunk_overlap=100,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
x = r_splitter.split_text(some_text)
print(x)
for thing in x:
    print(len(thing))

Output

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.", 
'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
248
243

答案1

得分: 1

我发现RecursiveCharacterTextSplitter不会重叠由分隔符分割的块,就像你的设置一样:
separators=["\n\n", "\n", "(?<=\. )", " ", ""]

所发生的是,由于\n\n分隔符,你的两个段落都被分成了自己的整个块。因此,这些块被视为独立的,不会生成重叠。如果你有一个大于350块大小的段落(或者如果你的块大小更小),那么段落将被分成多个块,这些块将会重叠。

我猜想这个包的逻辑是,由于你是有意义地将这些段落分开的,你不会希望它们的消息重叠。如果这是你想要的,我建议删除相关的分隔符。

注意:当考虑到" "也是一个分隔符时,我的答案就有点问题了。你可能会认为这会使每个单词都成为自己的块。我还不理解这部分。

英文:

I found that RecursiveCharacterTextSplitter will not overlap chunks that are split by a separator, like how you have it:
separators=[&quot;\n\n&quot;, &quot;\n&quot;, &quot;(?&lt;=\. )&quot;, &quot; &quot;, &quot;&quot;]

What's happening is that each of your two paragraphs is being made into its own whole chunk due to the \n\n separator. Thus these chunks are considered separate and will not generate overlap. If you had a paragraph that was greater than your 350 chunk size (or if your chunk size was smaller), the paragraph would get split into multiple chunks, and those chunks would have overlap.

I assume the package's logic is that, since you are purposefully semantically separating those paragraphs, you wouldn't want them to have their messages overlapped. If it is something you want, I'd recommend removing the relevant separators.

Note: My answer breaks down a little when you consider that &quot; &quot; is a separator as well. You'd think that would make each word its own chunk. I don't understand that part yet.

huangapple
  • 本文由 发表于 2023年7月13日 23:56:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681318.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定