2023年2月14日 21:22:14go评论136阅读模式

英文:

Dynamodb streams: small number of items per batch

问题

我有一个非常大的DynamoDB表，我想要使用由流触发的Lambda函数。我想要以至少1000个项目的大批量方式处理数据。但是当我连接Lambda时，我看到它被调用时只处理1或2个项目的小批次。我将窗口大小增加到15秒，但没有帮助。

我认为这可能是因为表具有许多分片，每个批次只从一个分片中收集项目。这个理解正确吗？

为了增加批量大小，可以采取哪些措施？

英文:

I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.

I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?

What can be done in order to increase the batch size?

答案1

得分: 3

I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.

DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.

Conceptually, this is how the Lambda Service polls the stream shards:

Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.

This diagram shows how the configuration options in the event source mapping influence how processing happens.

Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).

If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).

The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.

You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.

(I also considered writing to SQS, but the max batch size you can request from there is 10.)

英文:

I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.

Conceptually, this is how the Lambda Service polls the stream shards:

Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.

This diagram shows how the configuration options in the event source mapping influence how processing happens.

(I also considered writing to SQS, but the max batch size you can request from there is 10.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

DynamoDB 流：每批次少量项目

问题

答案1

无法使用Athena查询KMS CMK加密的S3存储桶中的CloudTrail日志

为什么在持续运行请求的情况下，Go cqlsh无法获取令牌？

Is there a way to add access logs on already existing application load balancer in aws using terraform

AWS Lightsail：”在检索容器注册表登录凭据之前，您必须创建一个容器服务。”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论