英文:
Export 50 GB Elasticsearch indices to S3 as JSON/text
问题
我需要将大量的Elasticsearch索引以JSON格式导出到S3,其中每个索引大约为50GB。我已经研究了多种方法来完成这个任务,但由于数据的规模,我需要最高效的方法。
我尝试了elasticdump,但通过测试,我认为它会在将整个索引存储在内存中后再将其转储到S3作为文件。因此,我需要具有超过50G内存的EC2实例。是否有办法使其转储一系列较小大小的文件,而不是一个巨大的文件吗?
还有其他选择,如使用Logstash或Python的Elasticsearch库与可能的辅助程序来执行操作。
对于这个任务,什么是最佳的方法呢?
英文:
I need to export a large number of Elasticsearch indices to S3 in JSON format, where each index is around 50GB in size. I've been looking into a number of ways of doing this, but I need the most time efficient method, due to the size of the data.
I tried elasticdump, but from testing this out, I think it stores the whole index in memory before dumping it as a file to S3. So I'd need an EC2 instance with memory in excess of 50G. Is there any way of getting it to dump a series of smaller sized files instead of one huge file?
There are other options like using Logstash, or python's Elasticsearch library with possible helpers to do the operation.
What would be the best method for this?
答案1
得分: 1
ES到S3 Logstash管道
将来自Elasticsearch的原始json
移动到S3存储桶,您可以在Logstash管道中使用S3输出。以下是要遵循的示例管道:
input {
elasticsearch {
hosts => ["localhost:9200"]
index => "myindex-*"
query => '{ "query": { "match_all": {} } }'
}
}
filter {
# 这里是您的过滤器配置
}
output {
s3 {
bucket => "BUCKET_NAME"
region => "us-east-1"
access_key_id => "ACCESS_KEY"
secret_access_key => "SECRET_KEY"
canned_acl => "private"
prefix => "logs/" # 可选
time_file => 300
codec => json_lines {}
codec => plain {
format => "%{[message]}"
}
}
}
S3输出插件 - 参数
- bucket:要保存数据的S3存储桶的名称。
- region:S3存储桶所在的AWS区域。
- access_key_id:具有向S3存储桶写入权限的AWS访问密钥ID。
- secret_access_key:与访问密钥ID相关联的AWS秘密访问密钥。
- prefix:要添加到保存数据的对象键的前缀。
- time_file:在将数据刷新到S3之前缓冲数据的最长时间(以秒为单位)。例如:
time_file => 300
表示5分钟(5分钟 * 每分钟60秒 = 300秒)。 - codec:用于对要保存的数据进行编码的编解码器。在此示例中,我们使用了两种编解码器 -
json_lines
用于编码JSON数据,plain
用于将数据格式化为字符串。
如果您在ECS容器或EC2中运行此管道,则无需提供ACCESS_KEY
和SECRET_KEY
,您可以为安全原因创建角色并分配给ECS或EC2。
英文:
ES to S3 Logstash Pipeline
To move raw json
from Elasticsearch to S3 bucket, you can use the s3 output in logstash pipeline. Here is the example pipeline to follow
input {
elasticsearch {
hosts => ["localhost:9200"]
index => "myindex-*"
query => '{ "query": { "match_all": {} } }'
}
}
filter {
# Your filter configuration here
}
output {
s3 {
bucket => "BUCKET_NAME"
region => "us-east-1"
access_key_id => "ACCESS_KEY"
secret_access_key => "SECRET_KEY"
canned_acl => "private"
prefix => "logs/" # Optional
time_file => 300
codec => json_lines {}
codec => plain {
format => "%{[message]}"
}
}
}
S3 output plugin - parameters
- bucket: The name of the S3 bucket to save the data to.
- region: The AWS region that the S3 bucket is located in.
- access_key_id: The AWS access key ID with permission to write to the S3 bucket.
- secret_access_key: The AWS secret access key associated with the access key ID.
- prefix: A prefix to be added to the object key of the saved data.
- time_file: The maximum time in seconds to buffer data before flushing it to S3. ex:
time_file => 300
5 minutes (5 minutes * 60 seconds per minute = 300 seconds). - codec: The codec used to encode the data to be saved. In this example, we are using two codecs -
json_lines
to encode the JSON data, and plain to format the data as a string.
If you are running this pipeline in ECS Container or EC2 you don't need to provide the ACCESS_KEY
and SECRET_KEY
, you can create the role and assign to ECS or EC2 for security reasons.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论