2023年5月7日 20:04:13go评论58阅读模式

英文:

s3 to aws kenesis to opensearch

问题

我有一个包含大约1000万个对象的S3存储桶，所有这些对象都需要进行处理并发送到OpenSearch。为此，我正在评估是否可以使用Kinesis来完成此任务。

在线上的解决方案似乎暗示要使用Lambda，但考虑到我有1000万个对象，我认为函数会在for循环用尽之前超时。

因此，我想要的设置是：

s3 --> (某个生产者) --> Kinesis 数据流 --> OpenSearch（目的地）

请问最佳操作方式是什么？

英文:

I have a s3 bucket with some 10M objects which all needs to be processed and sent to opensearch. To this end I am evaluating if kenisis can be used for this.

The solutions online seem to imply using lambda, but since I have 10M objects, I would think the function will timeout by the time the for loop is exhausted.

So, the setup I would like is:

s3 --&gt; (some producer) --&gt; kenesis data streams --&gt; opensearch (destination)

What would be the optimal way to go about this please

答案1

得分: 2

Mark B的答案绝对是一个可行的选项，我建议配置您的SQS队列以触发Lambda以处理每条消息。

除非您需要Kinesis进行某些ETL功能，否则您可以直接从S3到OpenSearch。

假设S3中的文档适合OpenSearch，我会采取以下一种方式：

AWS Step Functions具有一个内置模式来处理S3中的项目。这将迭代选择的存储桶（或文件夹等）中与您的描述匹配的所有对象。然后，可以将每个对象发送到Lambda函数以将其内容保存到OpenSearch中。
- 假设您有一些ETL或格式要求，这在Lambda中很容易实现。
- 我找不到SFN S3模式的任何文档，但它们在Workflow Studio中可用，请参阅此截图。
如果您熟悉Python，AWS SDK for Pandas（以前称为AWS Data Wrangler）是一个非常简单的选项。我已经广泛使用它来轻松将数据从CSV、S3和其他位置移动到OpenSearch中。

使用AWS SDK for Pandas，您可以像这样实现您要查找的内容...

import awswrangler as wr
from opensearchpy import OpenSearch

items = wr.s3.read_json(path=&quot;s3://my-bucket/my-folder/&quot;)

# 连接+上传到OpenSearch
my_client = OpenSearch(...)
wr.opensearch.index_df(client=my_client, df=items)

AWS SDK for Pandas可以迭代S3项目的块，并且有一个关于从S3到OpenSearch索引JSON（和其他文件类型）的教程。

英文:

Mark B's answer is definitely a viable option, and I'd suggest configuring your SQS queue to trigger Lambda for each message.

Unless you need Kinesis for some ETL functionality, it's likely that you can go from S3 to OpenSearch directly.

Assuming the docs in S3 are formatted suitably for OpenSearch, I would take one of the following approaches:

AWS Step Functions has a built-in pattern to process items in S3. This would iterate over all the objects in a chosen bucket (or folder, etc.) that match your description. Each object could then be sent to a Lambda function to save its contents to OpenSearch.
- Assuming you have some ETL or formatting requirements, this would be easy to implement in Lambda.
- I can't find any documentation for the SFN S3 Patterns, but they're available in Workflow Studio, see this screenshot.
If you're comfortable with Python, the AWS SDK for Pandas (previously AWS Data Wrangler) is a super easy option. I've used it extensively for moving data from CSVs, S3, and other locations into OpenSearch with ease.

Using the AWS SDK for Pandas, you might achieve what you're looking for like this...

import awswrangler as wr
from opensearchpy import OpenSearch

items = wr.s3.read_json(path=&quot;s3://my-bucket/my-folder/&quot;)

# connect + upload to OpenSearch
my_client = OpenSearch(...)
wr.opensearch.index_df(client=my_client, df=items)

The AWS SDK for Pandas can iterate over chunks of S3 items, and there's a tutorial on indexing JSON (and other file types) from S3 to OpenSearch.

答案2

得分: 1

这是一篇关于此主题的官方博客文章，建议使用DMS服务将S3中的现有文件发送到Kinesis。

至于使用Lambda的所有建议，那些建议适用于文件尚未在S3中的情况，每次上传文件到S3时都会触发Lambda，只处理一个文件。没有人建议您在单个函数调用中使用Lambda来处理1000万个现有的S3文件。

如果您想要在当前情况下使用Lambda，您可以首先创建所有S3对象的列表，然后编写一个一次性脚本，将该对象列表输入到SQS队列中。然后，您可以有一个Lambda函数，随着时间的推移处理队列中的消息，从队列中获取对象键，从S3中读取文件，然后将其发送到Kinesis。Lambda可以以每次最多处理10个的批量方式处理这些文件。然后，您可以配置S3存储桶以将新对象通知发送到相同的SQS队列，Lambda将自动处理您稍后添加到存储桶的任何新对象。

英文:

Here's an official blog post on the subject that suggests using the DMS service to send existing files in S3 to Kinesis.

As for all the recommendations to use Lambda, those recommendations would be for the scenario where the files aren't in S3 yet, and Lambda would be triggered each time a file is uploaded to S3, processing just that one file. Nobody is recommending you use Lambda to process 10M existing S3 files in a single function invocation.

If you wanted to use Lambda for your current scenario, you could first create a list of all your S3 objects, and write a one-time script that feeds that list of objects into an SQS queue. Then you could have a Lambda function that processes the messages in the queue over time, taking the object key from the queue, reading the file from S3, and sending it to Kinesis. Lambda could process those in batches of up to 10 at a time. Then you could have the S3 bucket configured to send new object notifications to the same SQS queue, and Lambda would automatically processes any new objects that you add to the bucket later.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

s3 转移到 AWS Kinesis 到 OpenSearch

问题

答案1

答案2

将Go 1.6的Web应用部署到AWS Elastic Beanstalk。

如何在Lambda中使用Nodejs从S3下载PDF？

K8S集群级拓扑分布

按名称对S3存储桶中的AWS CLI对象列表进行排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论