英文:
LangChain RecursiveCharacterTextSplitter ONLY by character (ignoring chunk size)
问题
我有一堆 Terraform 文件,我想在 LangChain 中加载它们,并使用 RecursiveCharacterTextSplitter 进行拆分,但我只想使用分隔符,而忽略块大小。例如,假设我有以下 tf 文件:
provider "google" {
credentials = file("<YOUR_CREDENTIALS_JSON>")
project = "<YOUR_PROJECT_ID>"
region = "us-central1"
}
resource "google_storage_bucket" "my_bucket" {
name = "my-bucket-name"
location = "US"
}
resource "google_cloud_run_service" "default" {
name = "my-cloudrun-service"
location = "us-central1"
template {
spec {
containers {
image = "gcr.io/${var.project}/my-image:latest"
}
}
}
traffic {
percent = 100
latest_revision = true
}
}
resource "google_cloud_run_service_iam_member" "public" {
service = google_cloud_run_service.default.name
location = google_cloud_run_service.default.location
role = "roles/run.invoker"
member = "allUsers"
}
output "cloud_run_url" {
value = "${google_cloud_run_service.default.status[0].url}"
}
output "bucket_url" {
value = "gs://${google_storage_bucket.my_bucket.name}/"
}
我想创建三个文档,一个包含 provider,一个包含一个 resource,另一个包含另一个 resource。加载文档后,可以使用以下方式:
terraform_separators=[
# 首先,尝试按定义进行拆分
"\n\nresource ",
"\n\nmodule ",
"\n\ndata ",
"\n\nlocals ",
"\n\nvariable ",
"\n\noutput ",
"\nresource ",
"\nmodule ",
"\ndata ",
"\nlocals ",
"\nvariable ",
"\noutput ",
# 现在按正常类型的行进行拆分
"\n\n",
"\n",
" ",
"",
]
text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)
问题是 chunk_size 的重要性比分隔符本身更高,有没有办法只给分隔符更高的重要性?
英文:
I have a bunch of terraform files, and I want to load them in LangChain and split them by RecursiveCharacterTextSplitter, but I would like to ignore the chunk size and ONLY use the separators. For example. Let's say I have this tf file:
provider "google" {
credentials = file("<YOUR_CREDENTIALS_JSON>")
project = "<YOUR_PROJECT_ID>"
region = "us-central1"
}
resource "google_storage_bucket" "my_bucket" {
name = "my-bucket-name"
location = "US"
}
resource "google_cloud_run_service" "default" {
name = "my-cloudrun-service"
location = "us-central1"
template {
spec {
containers {
image = "gcr.io/${var.project}/my-image:latest"
}
}
}
traffic {
percent = 100
latest_revision = true
}
}
resource "google_cloud_run_service_iam_member" "public" {
service = google_cloud_run_service.default.name
location = google_cloud_run_service.default.location
role = "roles/run.invoker"
member = "allUsers"
}
output "cloud_run_url" {
value = "${google_cloud_run_service.default.status[0].url}"
}
output "bucket_url" {
value = "gs://${google_storage_bucket.my_bucket.name}/"
}
And what I want is creating three documents, one with the provider, one with one resource and one with the other resource. Once I load the document, it would make sense to use something like:
terraform_separators=[
# First, try to split along definitions
"\n\nresource ",
"\n\nmodule ",
"\n\ndata ",
"\n\nlocals ",
"\n\nvariable "
"\n\noutput ",
"\nresource ",
"\nmodule ",
"\ndata ",
"\nlocals ",
"\nvariable ",
"\noutput ",
# Now split by the normal type of lines
"\n\n",
"\n",
" ",
"",
]
text_splitter = RecursiveCharacterTextSplitter(separators=terraform_separators, chunk_size=300, chunk_overlap=0)
The problem is that the chunk_size has more importance than the separator itself, is there a way to give more importance to the separator only?
答案1
得分: 1
你可以继承基类RecursiveCharacterTextSplitter,并重写方法来实现你自定义的逻辑。
class CustomClass(RecursiveCharacterTextSplitter):
def split_text(self, text: str) -> List[str]:
pass # 你的自定义逻辑
英文:
You can inherit from the base class RecursiveCharacterTextSplitter and override method to implement you custom logic
class CustomClass(RecursiveCharacterTextSplitter):
def split_text(self, text: str) -> List[str]:
pass #Your custom login
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论