
huangapple go评论96阅读模式

Why is RecursiveCharacterTextSplitter not giving any chunk overlap?





  1. from langchain.text_splitter import RecursiveCharacterTextSplitter
  2. some_text = """当撰写文件时作者将使用文件结构来组织内容\
  3. 这可以传达给读者哪些想法相关例如密切相关的想法在句子中相似的想法在段落中段落构成文件\n\n\
  4. 段落通常用回车或两个回车符界定回车符是您在此字符串中嵌入的反斜杠n”。\
  5. 句子末尾有句点但也有一个空格而且单词之间用空格分隔"""
  6. r_splitter = RecursiveCharacterTextSplitter(
  7. chunk_size=350,
  8. chunk_overlap=100,
  9. separators=["\n\n", "\n", "(?<=\. )", " ", ""]
  10. )
  11. x = r_splitter.split_text(some_text)
  12. print(x)
  13. for thing in x:
  14. print(len(thing))


  1. ["当撰写文件时作者将使用文件结构来组织内容这可以传达给读者哪些想法相关例如密切相关的想法在句子中相似的想法在段落中段落构成文件",
  2. '段落通常用回车或两个回车符界定。回车符是您在此字符串中嵌入的“反斜杠n”。句子末尾有句点,但也有一个空格。而且单词之间用空格分隔。']
  3. 248
  4. 243

I am trying to create chunks (max) 350 characters long with 100 chunk overlap.

I understand that chunk_size is an upper limit, so I may get chunks shorter than that. But why am I not getting any chunk_overlap?

Is it because the overlap also has to split on one of the separator chars? So it's 100 chars chunk_overlap if there is a separator within 100 chars of the split that it can split on?

  1. from langchain.text_splitter import RecursiveCharacterTextSplitter
  2. some_text = """When writing documents, writers will use document structure to group content. \
  3. This can convey to the reader, which idea's are related. For example, closely related ideas \
  4. are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
  5. Paragraphs are often delimited with a carriage return or two carriage returns. \
  6. Carriage returns are the "backslash n" you see embedded in this string. \
  7. Sentences have a period at the end, but also, have a space.\
  8. and words are separated by space."""
  9. r_splitter = RecursiveCharacterTextSplitter(
  10. chunk_size=350,
  11. chunk_overlap=100,
  12. separators=["\n\n", "\n", "(?<=\. )", " ", ""]
  13. )
  14. x = r_splitter.split_text(some_text)
  15. print(x)
  16. for thing in x:
  17. print(len(thing))


  1. ["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
  2. 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
  3. 248
  4. 243


得分: 1

separators=["\n\n", "\n", "(?<=\. )", " ", ""]



注意:当考虑到" "也是一个分隔符时,我的答案就有点问题了。你可能会认为这会使每个单词都成为自己的块。我还不理解这部分。


I found that RecursiveCharacterTextSplitter will not overlap chunks that are split by a separator, like how you have it:
separators=[&quot;\n\n&quot;, &quot;\n&quot;, &quot;(?&lt;=\. )&quot;, &quot; &quot;, &quot;&quot;]

What's happening is that each of your two paragraphs is being made into its own whole chunk due to the \n\n separator. Thus these chunks are considered separate and will not generate overlap. If you had a paragraph that was greater than your 350 chunk size (or if your chunk size was smaller), the paragraph would get split into multiple chunks, and those chunks would have overlap.

I assume the package's logic is that, since you are purposefully semantically separating those paragraphs, you wouldn't want them to have their messages overlapped. If it is something you want, I'd recommend removing the relevant separators.

Note: My answer breaks down a little when you consider that &quot; &quot; is a separator as well. You'd think that would make each word its own chunk. I don't understand that part yet.

  • 本文由 发表于 2023年7月13日 23:56:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681318.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
