2023年7月11日 07:38:31go评论73阅读模式

英文:

How can I keep track of which files uploaded to S3 are still being used?

问题

我正在开发一个在线编辑工具。该工具允许用户上传不同类型的媒体和文件，这些文件存储在S3中。我正在思考一种强大且具有未来扩展性的方式来管理这些文件，尤其是当它们不再被使用时。目前，我正在使用Postgres作为我的数据库。我有一个表格，用于存储可能在编辑器中使用的所有不同元素。它具有常规字段，其中包含id、外键和所有元素之间的其他常见数据。还有一个JSONB字段，用于存储与元素相关的所有唯一数据。每种元素类型都会有不同的JSON结构，随着应用程序的增长和添加更多元素类型，结构变化会更多。文件位置/键可以嵌套在JSON对象的任何位置。

以下是我的应用程序示例：用户上传了一张图像，但随后决定不使用它并删除包含图像的元素。在那一刻，我可以无问题地从S3中删除图像。但是，如果用户改变主意并撤消了删除操作，图像将不再存在于S3中。一个潜在的解决方案可以是将图像存储在本地，然后重新上传它。虽然这可能是一个不错的解决方案，但它似乎实施起来比较复杂，可能会导致延迟和冗余的文件上传。如果我不删除文件，而用户最终删除了元素并永远不撤消该操作，那么文件现在将丢失在S3中，我的应用程序将无法知道它是否被使用或是否存在。

我对像Figma这样的应用程序如何处理这个问题感到好奇。

我考虑了各种解决方案，但对我目前正在使用的解决方案不满意。在我的当前方法中，数据库中的每个元素都有一个包含已上传的所有键的数组字段。当上传文件时，它的键会添加到此数组中，如果删除文件，则会删除键。一旦保存元素，后端会检查此数组中的更改。如果删除了任何键，就会创建一个包含这些键的记录，以便以后可以运行作业来删除所有未使用的文件，如果它们在一周内没有被访问。我认为这种方法相当脆弱，因为每次添加新的元素类型时，我们都需要记住添加和删除已上传的文件键，这很容易被忘记。为使此方法正常工作，需要遵循许多接触点和流程。我需要一种简单而强大的方法来管理这些上传的文件，以便不必要的文件不会在S3中积累。

如果您需要任何澄清，请告诉我。谢谢。

英文:

I am building an online editing tool. This tool allows users to upload different types of media and files, which are stored in S3. I'm struggling to think of a robust and future-proof way to manage these files, particularly when they are no longer being used. Currently, I am using Postgres for my database. I have a table that stores all the different elements that could be used in the editor. It has regular fields that hold ids, foreign keys and other common data among all the elements. There is a JSONB field to stores all the unique data relating to an element. Every element type will have a different JSON structure, and as the application grows and more element types are added, there will be even more variations in structure. File locations/keys can be nested anywhere in the JSON object.

Here's an example from my application: a user uploads an image, but then decides not to use it and deletes the element that contains the image. I could remove the image from S3 at that moment without any issues. However, if the user changes their mind and undoes the deletion, the image will no longer exist in S3. One potential solution could be to store the image locally and then re-upload it. While this might be a good solution, it seems complex to implement and could result in latency and redundant file uploads. If I don't delete the file and the user ends up deleting the element and never undoing that action the file is now lost in S3 and my application has no way of knowing it isn't being used or that it even exists.

I'm curious about how applications like Figma handle this problem.

I've considered various solutions, but I'm not satisfied with the one I'm currently using. In my current approach, every element in the database has a field containing an array of all the keys that have been uploaded. When a file is uploaded, its key is added to this array, and if it is deleted, the key is removed. Once the element is saved, the backend checks for changes in this array. If any keys have been removed, a record with those keys is created so that a job can be run later to delete all the unused files if they haven't been accessed for over a week. I believe this approach is quite fragile because every time a new element type is added, we need to remember to add and remove uploaded file keys, which could easily be forgotten. There are numerous touchpoints and processes that need to be followed for this approach to work properly. I need a simple and robust way to manage these uploaded files so that unnecessary files don't accumulate in S3.

Please let me know if you need any clarifications. Thanks.

答案1

得分: 1

这看起来是一个非常好的使用案例，可以利用AWS S3生命周期策略和对象标记。当用户上传一个对象到画布/元素时，您可以使用PutObject API将对象上传到S3并进行标记（假设标记为Referenced = True），然后继续按照当前的方式存储JSON数据。如果用户选择撤消操作，您可以使用类似的API来删除/更新对象标记（假设重新标记为Referenced = False）。

现在，根据您的应用程序要求，您可以设置一个S3生命周期策略，该策略指定 - 在7天后删除存储桶中所有具有标记Referenced = False的对象。

以下是您可以在S3存储桶上设置的示例S3生命周期配置：


&lt;LifecycleConfiguration&gt;
  &lt;Rule&gt;
    &lt;ID&gt;Rule 1&lt;/ID&gt;
    &lt;Filter&gt;
      &lt;Tag&gt;
         &lt;Key&gt;Referenced&lt;/Key&gt;
         &lt;Value&gt;False&lt;/Value&gt;
      &lt;/Tag&gt;
    &lt;/Filter&gt;
    &lt;Status&gt;Enabled&lt;/Status&gt;
    &lt;Expiration&gt;
      &lt;Days&gt;7&lt;/Days&gt;
    &lt;/Expiration&gt;
  &lt;/Rule&gt;
&lt;/LifecycleConfiguration&gt;

通过这个设置，您不再需要维护一个单独的数组/状态来存储要删除的文件，而是将对象的生命周期管理委托给AWS S3。

英文:

This looks like a very good use case of leveraging AWS S3 Lifecycle Policies along with object tagging. When a user uploads an object to the canvas/element, you can upload the object to S3 + Tag it (let's say tagged as Referenced = True) using PutObject API and continue to store the json as you are storing currently. If the user chooses to undo the operation, you can use a similar API to delete/update the object Tag (let's say re-tagged as Referenced = False)

Now, based on your application requirements, you can go ahead and setup a S3 lifecycle policy that says - After 7 days, delete all the objects in the bucket that have tag Referenced = False

Here's a sample S3 Lifecycle Configuration you can setup on your S3 bucket


&lt;LifecycleConfiguration&gt;
  &lt;Rule&gt;
    &lt;ID&gt;Rule 1&lt;/ID&gt;
    &lt;Filter&gt;
      &lt;Tag&gt;
         &lt;Key&gt;Referenced&lt;/Key&gt;
         &lt;Value&gt;False&lt;/Value&gt;
      &lt;/Tag&gt;
    &lt;/Filter&gt;
    &lt;Status&gt;Enabled&lt;/Status&gt;
    &lt;Expiration&gt;
      &lt;Days&gt;7&lt;/Days&gt;
    &lt;/Expiration&gt;
  &lt;/Rule&gt;
&lt;/LifecycleConfiguration&gt;

With this setup, you no longer need to maintain a separate array/state for files to be deleted and delegate the object lifecycle management to AWS S3.

答案2

得分: 1

以下是翻译好的部分：

进行每周或每月扫描数据，以制作所有引用的S3对象列表。
然后，删除一个月前的任何对象如果它们不在列表中被引用。
- 或者，保留列表并将其与下一个生成的列表进行比较。只有在_两个_列表中都没有被引用（本周/本月和上周/上月）的对象才删除。

您可以通过使用Amazon S3 Inventory获取当前存储桶中的对象列表，该工具可以提供每日或每周的CSV文件，列出所有对象。这样，您无需扫描S3中的所有对象，只需将清单报告与引用对象列表进行比较。

在S3中存储“未引用”的对象的成本并不高，因此没有删除对象的紧急性。等待一个月再删除它们将不会带来大的成本负担（与永远不删除对象相比）。

英文:

It might be easier to:

Perform a weekly or monthly scan of the data to make a list of all referenced S3 objects
Then, delete any objects older than a month if they are not referenced in the list
- Or, keep the list and compare it to the next list that is produced. Only delete objects if they are not referenced on both lists (this week/month and last week/month).

You can obtain a list of objects currently in the bucket by using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. This way, you don't need to scan all the objects in S3 -- just compare the Inventory report against the list of referenced objects.

The cost of storing "unreferenced" objects in S3 is not high, so there is no urgency to delete objects. Waiting a month to delete them will not be a large cost burden (compared with never deleting objects).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我如何跟踪在S3上上传的文件中仍在使用的文件？

问题

答案1

答案2

AWS PutItem在数据库中没有创建条目。

使用CloudCustodian筛选器Key进行EC2 AMI的JMESPath查询

如何通过Amazon Route53将请求按地区路由？

如何从GitHub Actions中使用AWS Code Artifact存储库进行身份验证？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论