DynamoDB 流:每批次少量项目

huangapple go评论56阅读模式
英文:

Dynamodb streams: small number of items per batch

问题

我有一个非常大的DynamoDB表,我想要使用由流触发的Lambda函数。我想要以至少1000个项目的大批量方式处理数据。但是当我连接Lambda时,我看到它被调用时只处理1或2个项目的小批次。我将窗口大小增加到15秒,但没有帮助。

我认为这可能是因为表具有许多分片,每个批次只从一个分片中收集项目。这个理解正确吗?

为了增加批量大小,可以采取哪些措施?

英文:

I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.

I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?

What can be done in order to increase the batch size?

答案1

得分: 3

I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.

DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.

DynamoDB 流:每批次少量项目

Conceptually, this is how the Lambda Service polls the stream shards:

DynamoDB 流:每批次少量项目

Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.

This diagram shows how the configuration options in the event source mapping influence how processing happens.

DynamoDB 流:每批次少量项目

Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).

If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).

The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.

You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.

(I also considered writing to SQS, but the max batch size you can request from there is 10.)

英文:

I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.

DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.

DynamoDB 流:每批次少量项目

Conceptually, this is how the Lambda Service polls the stream shards:

DynamoDB 流:每批次少量项目

Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.

This diagram shows how the configuration options in the event source mapping influence how processing happens.

DynamoDB 流:每批次少量项目

Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).

If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).

The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.

You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.

(I also considered writing to SQS, but the max batch size you can request from there is 10.)

huangapple
  • 本文由 发表于 2023年2月14日 21:22:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75448464.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定