LangChain RecursiveCharacterTextSplitter仅按字符拆分(忽略块大小)

huangapple go评论122阅读模式
英文:

LangChain RecursiveCharacterTextSplitter ONLY by character (ignoring chunk size)

问题

我有一堆 Terraform 文件,我想在 LangChain 中加载它们,并使用 RecursiveCharacterTextSplitter 进行拆分,但我只想使用分隔符,而忽略块大小。例如,假设我有以下 tf 文件:

  1. provider "google" {
  2. credentials = file("<YOUR_CREDENTIALS_JSON>")
  3. project = "<YOUR_PROJECT_ID>"
  4. region = "us-central1"
  5. }
  6. resource "google_storage_bucket" "my_bucket" {
  7. name = "my-bucket-name"
  8. location = "US"
  9. }
  10. resource "google_cloud_run_service" "default" {
  11. name = "my-cloudrun-service"
  12. location = "us-central1"
  13. template {
  14. spec {
  15. containers {
  16. image = "gcr.io/${var.project}/my-image:latest"
  17. }
  18. }
  19. }
  20. traffic {
  21. percent = 100
  22. latest_revision = true
  23. }
  24. }
  25. resource "google_cloud_run_service_iam_member" "public" {
  26. service = google_cloud_run_service.default.name
  27. location = google_cloud_run_service.default.location
  28. role = "roles/run.invoker"
  29. member = "allUsers"
  30. }
  31. output "cloud_run_url" {
  32. value = "${google_cloud_run_service.default.status[0].url}"
  33. }
  34. output "bucket_url" {
  35. value = "gs://${google_storage_bucket.my_bucket.name}/"
  36. }

我想创建三个文档,一个包含 provider,一个包含一个 resource,另一个包含另一个 resource。加载文档后,可以使用以下方式:

  1. terraform_separators=[
  2. # 首先,尝试按定义进行拆分
  3. "\n\nresource ",
  4. "\n\nmodule ",
  5. "\n\ndata ",
  6. "\n\nlocals ",
  7. "\n\nvariable ",
  8. "\n\noutput ",
  9. "\nresource ",
  10. "\nmodule ",
  11. "\ndata ",
  12. "\nlocals ",
  13. "\nvariable ",
  14. "\noutput ",
  15. # 现在按正常类型的行进行拆分
  16. "\n\n",
  17. "\n",
  18. " ",
  19. "",
  20. ]
  21. text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

问题是 chunk_size 的重要性比分隔符本身更高,有没有办法只给分隔符更高的重要性?

英文:

I have a bunch of terraform files, and I want to load them in LangChain and split them by RecursiveCharacterTextSplitter, but I would like to ignore the chunk size and ONLY use the separators. For example. Let's say I have this tf file:

  1. provider &quot;google&quot; {
  2. credentials = file(&quot;&lt;YOUR_CREDENTIALS_JSON&gt;&quot;)
  3. project = &quot;&lt;YOUR_PROJECT_ID&gt;&quot;
  4. region = &quot;us-central1&quot;
  5. }
  6. resource &quot;google_storage_bucket&quot; &quot;my_bucket&quot; {
  7. name = &quot;my-bucket-name&quot;
  8. location = &quot;US&quot;
  9. }
  10. resource &quot;google_cloud_run_service&quot; &quot;default&quot; {
  11. name = &quot;my-cloudrun-service&quot;
  12. location = &quot;us-central1&quot;
  13. template {
  14. spec {
  15. containers {
  16. image = &quot;gcr.io/${var.project}/my-image:latest&quot;
  17. }
  18. }
  19. }
  20. traffic {
  21. percent = 100
  22. latest_revision = true
  23. }
  24. }
  25. resource &quot;google_cloud_run_service_iam_member&quot; &quot;public&quot; {
  26. service = google_cloud_run_service.default.name
  27. location = google_cloud_run_service.default.location
  28. role = &quot;roles/run.invoker&quot;
  29. member = &quot;allUsers&quot;
  30. }
  31. output &quot;cloud_run_url&quot; {
  32. value = &quot;${google_cloud_run_service.default.status[0].url}&quot;
  33. }
  34. output &quot;bucket_url&quot; {
  35. value = &quot;gs://${google_storage_bucket.my_bucket.name}/&quot;
  36. }

And what I want is creating three documents, one with the provider, one with one resource and one with the other resource. Once I load the document, it would make sense to use something like:

  1. terraform_separators=[
  2. # First, try to split along definitions
  3. &quot;\n\nresource &quot;,
  4. &quot;\n\nmodule &quot;,
  5. &quot;\n\ndata &quot;,
  6. &quot;\n\nlocals &quot;,
  7. &quot;\n\nvariable &quot;
  8. &quot;\n\noutput &quot;,
  9. &quot;\nresource &quot;,
  10. &quot;\nmodule &quot;,
  11. &quot;\ndata &quot;,
  12. &quot;\nlocals &quot;,
  13. &quot;\nvariable &quot;,
  14. &quot;\noutput &quot;,
  15. # Now split by the normal type of lines
  16. &quot;\n\n&quot;,
  17. &quot;\n&quot;,
  18. &quot; &quot;,
  19. &quot;&quot;,
  20. ]
  21. text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

The problem is that the chunk_size has more importance than the separator itself, is there a way to give more importance to the separator only?

答案1

得分: 1

你可以继承基类RecursiveCharacterTextSplitter,并重写方法来实现你自定义的逻辑。

  1. class CustomClass(RecursiveCharacterTextSplitter):
  2. def split_text(self, text: str) -> List[str]:
  3. pass # 你的自定义逻辑
英文:

You can inherit from the base class RecursiveCharacterTextSplitter and override method to implement you custom logic

  1. class CustomClass(RecursiveCharacterTextSplitter):
  2. def split_text(self, text: str) -&gt; List[str]:
  3. pass #Your custom login

huangapple
  • 本文由 发表于 2023年8月8日 23:26:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861020.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定