计划在将数据导入到S3后清理DynamoDB。

huangapple go评论52阅读模式
英文:

Scheduled clear up of DynamoDB after import to s3

问题

我有一个 Dynamo DB 表,需要每周/每月执行以下操作:

  1. 将数据导出到 S3
  2. 从 Dynamo DB 中删除已导出到 S3 的数据

用例:我们只有10%的流量开放,并且有3,000个项目并且正在增长。此外,我们需要为另一个帐户提供对这些数据的访问权限,并且不希望直接提供对表的访问权限。为了节省检索时间并允许不同帐户访问数据,而且数据可能在不久的将来不再使用,我们计划将数据导入到 S3。

选项

  1. 数据管道过于复杂,我们不希望使用 EMR 集群。
  2. 不使用 Glue,因为不需要执行分析。
  3. 使用 AWS 内置的 DynamoDB 到 S3 导入功能。

计划进行 S3 导入(选项3),并使用 Lambda 安排导入并根据 EventBridge 规则删除 Dynamo DB 记录。

这样是否足够,还是有更好的方法?请建议。

英文:

I have dynamo DB table on which I need to perform these actions on a weekly/monthly basis.

  1. export data into s3
  2. delete from Dynamo DB, the data exported into S3

Use case: We have only 10% traffic open and have 3k items and growing. Also we need to give access to this data for another account and prefer not to give access to table directly. To save the retrieve time and allow data access to different account, and data may not be used again in near future, we are planning to import data to S3.

Options:

  1. Data pipeline is too complex and we don't wish to use EMR cluster.
  2. Not going with glue since there is no analysis to be performed.
  3. AWS in-build DynamoDB to S3 import

Planning for s3 import(3)+ lambda to schedule the import and delete the dynamo DB records based on EventBridge rule.

Will this suffice or is there any better approach? Please advice.

答案1

得分: 1

Evergreen tables pattern

  1. 创建每个月一个新表,让您的应用程序根据当前时间写入新表
  2. 当新的月份到来时,旧月份的表可以导出到S3。
  3. 在导出完成并且不再需要旧月份表时,删除旧月份表

这可能是最具成本效益的选项,因为您可以更好地控制项目存在的时间。最大的麻烦是需要预留新的表格,更新权限,并在正确的时间切换应用程序逻辑。一旦启动并运行,应该会很顺利。对于那些定期旋转模型等不想支付删除所有旧数据费用的人来说,这是一种非常常见的模式。如果对旧数据存在严格的SLA要求,这可能是最佳选项。

TTL模式

  1. 在月底将所有数据设置为TTL
  2. 在TTL窗口之前导出数据
  3. 让TTL到期后删除项目

这个模式的问题是TTL可能需要相当长的时间(几天)来清理大量项目,因为它使用后台WCUs,这意味着您需要更长时间地支付存储费用。好处是它在WCUs上具有成本效益。如果没有特定时间要求将数据从DDB中取出,这个方法就可以正常工作。

Glue扫描和删除模式

我建议使用Glue,但实际上就是Spark类似的工具非常有效,即使它不是用于分析。您也可以使用类似Step Functions的东西来实现,如果您更愿意这样做。

  1. 启动导出
  2. 使用Glue中的导出数据,然后由Glue触发DDB的删除操作

这种方法的缺点是相当昂贵(必须具备额外的WCUs来处理删除操作)。从应用程序的角度来看,这相当简单。如果不能更改应用程序逻辑(设置TTL或写入的表格),我会选择这个选项。

英文:

A few options to consider:

Evergreen tables pattern

  1. Create a new table each month, have your application write to the new table based on current time
  2. When new month comes, old month's table can be exported to S3.
  3. Delete old month's table after export is done and you don't need it anymore

This one is probably the most cost effective because you can control the duration the items sit around better. The biggest hassle is needing to provision new tables, update permissions, and have application logic to switch at the right now. Once it's up and running, it should be smooth though. This is a pattern that's really common for folks using DDB for things like ML models where they rotate them regularly and don't wanna pay for deleting all the old data. If you have strict SLAs on how long old data can be around, this might be the best option.

TTL pattern

  1. Set all your data to TTL at the end of the month
  2. Export your data before TTL window
  3. Let TTL expire items

This has the issue that TTL can take a fairly long time (days) to clean up a lot of items, since it's using background WCUs, which means you pay for the storage for a bit longer. Plus side is that it is cost effective on WCUs. If you don't have some compliance need to get the data off DDB at a specific time, this works fine.

Glue scan and delete pattern

I say use Glue, but really it's just that Spark-like things are pretty effective at doing stuff like this, even if it isn't analytics. You can also make it work with something like Step Functions, if you'd rather do that.

  1. Kick off export
  2. Use the export data in Glue to then have Glue kick off deletes of DDB

This has the downside of being fairly expensive (gotta have extra WCUs to handle the deletes). It's fairly simple from your application's perspective, though. If you can't change application logic (to set TTL or which table is being written to), I'd go with this option.

答案2

得分: 0

你可以使用 https://www.npmjs.com/package/dynoport 来以高性能的方式从 DynamoDB 导出数据,并使用 ECS 定时任务将其导出到 S3。

英文:

You can use https://www.npmjs.com/package/dynoport to export data from dynamodb in a high performant way and export it to s3 using a ecs cron

huangapple
  • 本文由 发表于 2023年6月8日 01:23:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425721.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定