如何从S3存储桶中检索新条目

huangapple go评论67阅读模式
英文:

How to retrieve new entries from an S3 bucket

问题

我有一个Amazon S3存储桶。

大约每分钟,会向该存储桶添加一个新文件。

我需要一个程序来检索这些文件(存储在程序内存中,而不是磁盘上)以进行处理。

该程序是手动启动的,每次启动时它应该只处理之前未处理过的文件。

S3 SDK中是否有某个函数可以让我像这样轻松编写:

var newFiles = s3client.GetFilesAfter(timestamp);

或者我是否需要编写一个更复杂的解决方案?

编辑:我标记了这个问题为C#,但如果不同编程语言中的API明显不同,我也愿意接受其他编程语言的解决方案。

英文:

I have an amazon S3 bucket.

Every minute or so, a new file is added to that bucket.

I need a program to retrieve those files (To program memory, not to disk) for processing.

The program is launched manually, and on each launch it should process only files that have not been processed before.

Is there a function somewhere in the S3 SDK that lets me write it as easily as

var newFiles = s3client.GetFilesAfter(timestamp);

or will I have to write a more involved solution?

EDIT: I tagged this C# but if the API is markedly different in different languages, I am open to solutions in other languages.

答案1

得分: 2

Amazon S3事件通知 可以在对象被创建/修改/删除时自动触发。

该事件可以:

  • 向Amazon Simple Notification Service主题发送消息
  • 推送消息到Amazon Simple Queue Service队列
  • 调用AWS Lambda函数

最简单的方法是调用一个AWS Lambda函数,该函数可以(希望能够)运行您的代码。该函数将传递触发函数的对象的Bucket和Key,因此您的代码可能只会处理一个对象,但它会在对象创建后立即发生。

或者,如果您希望继续运行现有的代码,它需要:

  • 列出存储桶的整个内容
  • 将列表与先前运行的时间进行比较
  • 确定哪些对象已被创建/修改
英文:

Amazon S3 Event Notifications can be automatically triggered when objects are created/modified/deleted.

The event can:

  • Send a message to an Amazon Simple Notification Service topic
  • Push a message into an Amazon Simple Queue Service queue
  • Invoke an AWS Lambda function

The easiest method would be to invoke an AWS Lambda function that can (hopefully) run your code. The function will be passed the Bucket and Key of the object that triggered the function so your code will likely just process the one object, but it will happen immediately after the object is created.

Alternatively, if you want to keep running your existing code, it will need to:

  • List the entire contents of the bucket
  • Compare the listing against the previous run time
  • Determine which objects have been created/modified

答案2

得分: 1

你可以使用S3 API来获取在特定时间戳之后修改的对象列表,使用--query参数。

aws s3api list-objects-v2 --bucket "$bucket" \
    --query 'Contents[?LastModified > `2023-05-30`]'

但挑战在于它会返回带有分页标记的1000个键。因此,你需要迭代它直到最后一个键。如果键的数量很大,可能会遇到性能问题。

轻量级的方法是利用S3的事件通知,可以触发后续的处理。

你可以创建任何类型的事件通知触发器,如AWS S3事件通知中定义的,以及事件通知类型和目的地

你可以选择要么

  • 将所有事件发送到消息队列,例如SNS/SQS,并配置它以触发Lambda函数
  • 或者直接从S3事件调用Lambda函数。

示例:
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

英文:

You could use S3 API to fetch the list of Object which are modified after a certain timestamp with --query parameter.

aws s3api list-objects-v2 --bucket "$bucket" \
    --query 'Contents[?LastModified > `2023-05-30`]' 

But the challenge is that it will return 1000 Keys with the pagination marker. Hence you will have to iterate it until last key. You might experience the performance issues in case of huge number of keys.

The light weight approach would be to utilize the Event Notifications from S3 which can trigger the subsequent processing.

You can create any kind of EventNotification trigger as defined in the
AWS S3 Event Notifications and
Event notification types and destinations

You can either choose

  • sending all events to a message queue i.e. SNS/SQS and configure that to trigger a lambda function
  • or directly invoke Lambda function from S3 events.

Examples:
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

huangapple
  • 本文由 发表于 2023年6月1日 18:04:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定