问题

我有一堆 Terraform 文件，我想在 LangChain 中加载它们，并使用 RecursiveCharacterTextSplitter 进行拆分，但我只想使用分隔符，而忽略块大小。例如，假设我有以下 tf 文件：

provider "google" {
  credentials = file("<YOUR_CREDENTIALS_JSON>")
  project     = "<YOUR_PROJECT_ID>"
  region      = "us-central1"
}

resource "google_storage_bucket" "my_bucket" {
  name     = "my-bucket-name"
  location = "US"
}

resource "google_cloud_run_service" "default" {
  name     = "my-cloudrun-service"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "gcr.io/${var.project}/my-image:latest"
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

resource "google_cloud_run_service_iam_member" "public" {
  service  = google_cloud_run_service.default.name
  location = google_cloud_run_service.default.location
  role     = "roles/run.invoker"
  member   = "allUsers"
}

output "cloud_run_url" {
  value = "${google_cloud_run_service.default.status[0].url}"
}

output "bucket_url" {
  value = "gs://${google_storage_bucket.my_bucket.name}/"
}

我想创建三个文档，一个包含 provider，一个包含一个 resource，另一个包含另一个 resource。加载文档后，可以使用以下方式：

terraform_separators=[
                # 首先，尝试按定义进行拆分
                "\n\nresource ",
                "\n\nmodule ",
                "\n\ndata ",
                "\n\nlocals ",
                "\n\nvariable ",
                "\n\noutput ",
                "\nresource ",
                "\nmodule ",
                "\ndata ",
                "\nlocals ",
                "\nvariable ",
                "\noutput ",
                # 现在按正常类型的行进行拆分
                "\n\n",
                "\n",
                " ",
                "",
            ]

text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

问题是 chunk_size 的重要性比分隔符本身更高，有没有办法只给分隔符更高的重要性？

英文:

I have a bunch of terraform files, and I want to load them in LangChain and split them by RecursiveCharacterTextSplitter, but I would like to ignore the chunk size and ONLY use the separators. For example. Let's say I have this tf file:

provider &quot;google&quot; {
  credentials = file(&quot;&lt;YOUR_CREDENTIALS_JSON&gt;&quot;)
  project     = &quot;&lt;YOUR_PROJECT_ID&gt;&quot;
  region      = &quot;us-central1&quot;
}

resource &quot;google_storage_bucket&quot; &quot;my_bucket&quot; {
  name     = &quot;my-bucket-name&quot;
  location = &quot;US&quot;
}

resource &quot;google_cloud_run_service&quot; &quot;default&quot; {
  name     = &quot;my-cloudrun-service&quot;
  location = &quot;us-central1&quot;

  template {
    spec {
      containers {
        image = &quot;gcr.io/${var.project}/my-image:latest&quot;
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

resource &quot;google_cloud_run_service_iam_member&quot; &quot;public&quot; {
  service  = google_cloud_run_service.default.name
  location = google_cloud_run_service.default.location
  role     = &quot;roles/run.invoker&quot;
  member   = &quot;allUsers&quot;
}

output &quot;cloud_run_url&quot; {
  value = &quot;${google_cloud_run_service.default.status[0].url}&quot;
}

output &quot;bucket_url&quot; {
  value = &quot;gs://${google_storage_bucket.my_bucket.name}/&quot;
}

And what I want is creating three documents, one with the provider, one with one resource and one with the other resource. Once I load the document, it would make sense to use something like:

terraform_separators=[
                # First, try to split along definitions
                &quot;\n\nresource &quot;,
                &quot;\n\nmodule &quot;,
                &quot;\n\ndata &quot;,
                &quot;\n\nlocals &quot;,
                &quot;\n\nvariable &quot;
                &quot;\n\noutput &quot;,
                &quot;\nresource &quot;,
                &quot;\nmodule &quot;,
                &quot;\ndata &quot;,
                &quot;\nlocals &quot;,
                &quot;\nvariable &quot;,
                &quot;\noutput &quot;,
                # Now split by the normal type of lines
                &quot;\n\n&quot;,
                &quot;\n&quot;,
                &quot; &quot;,
                &quot;&quot;,
            ]

text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)

The problem is that the chunk_size has more importance than the separator itself, is there a way to give more importance to the separator only?

答案1

得分: 1

你可以继承基类RecursiveCharacterTextSplitter，并重写方法来实现你自定义的逻辑。

class CustomClass(RecursiveCharacterTextSplitter):
    def split_text(self, text: str) -> List[str]:
        pass  # 你的自定义逻辑

英文:

You can inherit from the base class RecursiveCharacterTextSplitter and override method to implement you custom logic

 class CustomClass(RecursiveCharacterTextSplitter):
     def split_text(self, text: str) -&gt; List[str]:
         pass #Your custom login

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

LangChain RecursiveCharacterTextSplitter仅按字符拆分（忽略块大小）

问题

答案1

如何缓存Terraform init

如何从文件对象列表中加载并拆分数据。

设置Terraform中的IAM策略

langchain + Weaviate如何一次访问多列

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论