问题

我需要将大量的Elasticsearch索引以JSON格式导出到S3，其中每个索引大约为50GB。我已经研究了多种方法来完成这个任务，但由于数据的规模，我需要最高效的方法。

我尝试了elasticdump，但通过测试，我认为它会在将整个索引存储在内存中后再将其转储到S3作为文件。因此，我需要具有超过50G内存的EC2实例。是否有办法使其转储一系列较小大小的文件，而不是一个巨大的文件吗？

还有其他选择，如使用Logstash或Python的Elasticsearch库与可能的辅助程序来执行操作。

对于这个任务，什么是最佳的方法呢？

英文:

I need to export a large number of Elasticsearch indices to S3 in JSON format, where each index is around 50GB in size. I've been looking into a number of ways of doing this, but I need the most time efficient method, due to the size of the data.

I tried elasticdump, but from testing this out, I think it stores the whole index in memory before dumping it as a file to S3. So I'd need an EC2 instance with memory in excess of 50G. Is there any way of getting it to dump a series of smaller sized files instead of one huge file?

There are other options like using Logstash, or python's Elasticsearch library with possible helpers to do the operation.

What would be the best method for this?

答案1

得分: 1

ES到S3 Logstash管道

将来自Elasticsearch的原始json移动到S3存储桶，您可以在Logstash管道中使用S3输出。以下是要遵循的示例管道：

input {
  elasticsearch {
    hosts =&gt; [&quot;localhost:9200&quot;]
    index =&gt; &quot;myindex-*&quot;
    query =&gt; &#39;{ &quot;query&quot;: { &quot;match_all&quot;: {} } }&#39;
  }
}

filter {
  # 这里是您的过滤器配置
}

output {
  s3 {
    bucket =&gt; &quot;BUCKET_NAME&quot;
    region =&gt; &quot;us-east-1&quot;
    access_key_id =&gt; &quot;ACCESS_KEY&quot;
    secret_access_key =&gt; &quot;SECRET_KEY&quot;
    canned_acl =&gt; &quot;private&quot;
    prefix =&gt; &quot;logs/&quot; # 可选
    time_file =&gt; 300
    codec =&gt; json_lines {}
    codec =&gt; plain {
      format =&gt; &quot;%{[message]}&quot;
    }
  }
}

S3输出插件 - 参数

bucket：要保存数据的S3存储桶的名称。
region：S3存储桶所在的AWS区域。
access_key_id：具有向S3存储桶写入权限的AWS访问密钥ID。
secret_access_key：与访问密钥ID相关联的AWS秘密访问密钥。
prefix：要添加到保存数据的对象键的前缀。
time_file：在将数据刷新到S3之前缓冲数据的最长时间（以秒为单位）。例如：time_file => 300 表示5分钟（5分钟 * 每分钟60秒 = 300秒）。
codec：用于对要保存的数据进行编码的编解码器。在此示例中，我们使用了两种编解码器 - json_lines 用于编码JSON数据，plain 用于将数据格式化为字符串。

如果您在ECS容器或EC2中运行此管道，则无需提供ACCESS_KEY和SECRET_KEY，您可以为安全原因创建角色并分配给ECS或EC2。

英文:

ES to S3 Logstash Pipeline

To move raw json from Elasticsearch to S3 bucket, you can use the s3 output in logstash pipeline. Here is the example pipeline to follow

input {
  elasticsearch {
    hosts =&gt; [&quot;localhost:9200&quot;]
    index =&gt; &quot;myindex-*&quot;
    query =&gt; &#39;{ &quot;query&quot;: { &quot;match_all&quot;: {} } }&#39;
  }
}

filter {
  # Your filter configuration here
}

output {
  s3 {
    bucket =&gt; &quot;BUCKET_NAME&quot;
    region =&gt; &quot;us-east-1&quot;
    access_key_id =&gt; &quot;ACCESS_KEY&quot;
    secret_access_key =&gt; &quot;SECRET_KEY&quot;
    canned_acl =&gt; &quot;private&quot;
    prefix =&gt; &quot;logs/&quot; # Optional 
    time_file =&gt; 300
    codec =&gt; json_lines {}
    codec =&gt; plain {
      format =&gt; &quot;%{[message]}&quot;
    }
  }
}

S3 output plugin - parameters

bucket: The name of the S3 bucket to save the data to.
region: The AWS region that the S3 bucket is located in.
access_key_id: The AWS access key ID with permission to write to the S3 bucket.
secret_access_key: The AWS secret access key associated with the access key ID.
prefix: A prefix to be added to the object key of the saved data.
time_file: The maximum time in seconds to buffer data before flushing it to S3. ex: time_file => 300 5 minutes (5 minutes * 60 seconds per minute = 300 seconds).
codec: The codec used to encode the data to be saved. In this example, we are using two codecs - json_lines to encode the JSON data, and plain to format the data as a string.

If you are running this pipeline in ECS Container or EC2 you don't need to provide the ACCESS_KEY and SECRET_KEY, you can create the role and assign to ECS or EC2 for security reasons.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将50 GB的Elasticsearch索引导出到S3为JSON/text格式。

问题

答案1

ES到S3 Logstash管道

ES to S3 Logstash Pipeline

Python zappa任务用于将文件上传到S3存储桶。

ElasticSearch: 如何检查多个分词器将文本拆分成标记的方式？

Simple Example of how to get the un-presigned s3 object url using the golang aws sdk v2

使用Elasticsearch从传入的Pulsar中插入批量数据。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论