LangChain RecursiveCharacterTextSplitter仅按字符拆分(忽略块大小)

huangapple go评论97阅读模式
英文:

LangChain RecursiveCharacterTextSplitter ONLY by character (ignoring chunk size)

问题

我有一堆 Terraform 文件,我想在 LangChain 中加载它们,并使用 RecursiveCharacterTextSplitter 进行拆分,但我只想使用分隔符,而忽略块大小。例如,假设我有以下 tf 文件:

provider "google" {
  credentials = file("<YOUR_CREDENTIALS_JSON>")
  project     = "<YOUR_PROJECT_ID>"
  region      = "us-central1"
}

resource "google_storage_bucket" "my_bucket" {
  name     = "my-bucket-name"
  location = "US"
}

resource "google_cloud_run_service" "default" {
  name     = "my-cloudrun-service"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "gcr.io/${var.project}/my-image:latest"
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

resource "google_cloud_run_service_iam_member" "public" {
  service  = google_cloud_run_service.default.name
  location = google_cloud_run_service.default.location
  role     = "roles/run.invoker"
  member   = "allUsers"
}

output "cloud_run_url" {
  value = "${google_cloud_run_service.default.status[0].url}"
}

output "bucket_url" {
  value = "gs://${google_storage_bucket.my_bucket.name}/"
}

我想创建三个文档,一个包含 provider,一个包含一个 resource,另一个包含另一个 resource。加载文档后,可以使用以下方式:

terraform_separators=[
                # 首先,尝试按定义进行拆分
                "\n\nresource ",
                "\n\nmodule ",
                "\n\ndata ",
                "\n\nlocals ",
                "\n\nvariable ",
                "\n\noutput ",
                "\nresource ",
                "\nmodule ",
                "\ndata ",
                "\nlocals ",
                "\nvariable ",
                "\noutput ",
                # 现在按正常类型的行进行拆分
                "\n\n",
                "\n",
                " ",
                "",
            ]

text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

问题是 chunk_size 的重要性比分隔符本身更高,有没有办法只给分隔符更高的重要性?

英文:

I have a bunch of terraform files, and I want to load them in LangChain and split them by RecursiveCharacterTextSplitter, but I would like to ignore the chunk size and ONLY use the separators. For example. Let's say I have this tf file:

provider &quot;google&quot; {
  credentials = file(&quot;&lt;YOUR_CREDENTIALS_JSON&gt;&quot;)
  project     = &quot;&lt;YOUR_PROJECT_ID&gt;&quot;
  region      = &quot;us-central1&quot;
}

resource &quot;google_storage_bucket&quot; &quot;my_bucket&quot; {
  name     = &quot;my-bucket-name&quot;
  location = &quot;US&quot;
}

resource &quot;google_cloud_run_service&quot; &quot;default&quot; {
  name     = &quot;my-cloudrun-service&quot;
  location = &quot;us-central1&quot;

  template {
    spec {
      containers {
        image = &quot;gcr.io/${var.project}/my-image:latest&quot;
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

resource &quot;google_cloud_run_service_iam_member&quot; &quot;public&quot; {
  service  = google_cloud_run_service.default.name
  location = google_cloud_run_service.default.location
  role     = &quot;roles/run.invoker&quot;
  member   = &quot;allUsers&quot;
}

output &quot;cloud_run_url&quot; {
  value = &quot;${google_cloud_run_service.default.status[0].url}&quot;
}

output &quot;bucket_url&quot; {
  value = &quot;gs://${google_storage_bucket.my_bucket.name}/&quot;
}

And what I want is creating three documents, one with the provider, one with one resource and one with the other resource. Once I load the document, it would make sense to use something like:

terraform_separators=[
                # First, try to split along definitions
                &quot;\n\nresource &quot;,
                &quot;\n\nmodule &quot;,
                &quot;\n\ndata &quot;,
                &quot;\n\nlocals &quot;,
                &quot;\n\nvariable &quot;
                &quot;\n\noutput &quot;,
                &quot;\nresource &quot;,
                &quot;\nmodule &quot;,
                &quot;\ndata &quot;,
                &quot;\nlocals &quot;,
                &quot;\nvariable &quot;,
                &quot;\noutput &quot;,
                # Now split by the normal type of lines
                &quot;\n\n&quot;,
                &quot;\n&quot;,
                &quot; &quot;,
                &quot;&quot;,
            ]

text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

The problem is that the chunk_size has more importance than the separator itself, is there a way to give more importance to the separator only?

答案1

得分: 1

你可以继承基类RecursiveCharacterTextSplitter,并重写方法来实现你自定义的逻辑。

class CustomClass(RecursiveCharacterTextSplitter):
    def split_text(self, text: str) -> List[str]:
        pass  # 你的自定义逻辑
英文:

You can inherit from the base class RecursiveCharacterTextSplitter and override method to implement you custom logic

 class CustomClass(RecursiveCharacterTextSplitter):
     def split_text(self, text: str) -&gt; List[str]:
         pass #Your custom login 

huangapple
  • 本文由 发表于 2023年8月8日 23:26:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861020.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定