英文:
How to share data between two sidekiq workers'
问题
有数十亿用户,每个用户可以拥有多个电子邮件地址,任务是以最短的时间给每个用户发送新年祝福。
有两个Sidekiq工作者
一个工作者获取用户的电子邮件地址。
另一个工作者只发送电子邮件通知。
思路是实现并行处理,一个工作者只获取用户电子邮件,另一个工作者只发送电子邮件。
是否有一种方法可以使用Sidekiq来实现这个目标?
英文:
Imagine there are billions of users and each user can have multiple email id, the task is to send happy new year greetings to each and every user, in a minimal time.
There are two sidekiq workers
one worker fetches user's email id.
Another worker only send email notifications.
The thought process is to achieve parallelism one worker will only fetch user email and another worker will only send email.
Is there a way to achieve this using sidekiq
答案1
得分: 2
以下是翻译好的部分:
- 你确实可以进行一些优化。
- 尽可能预先安排尽可能多的电子邮件。
- 向数据库添加列(甚至是整个数据库表),以显示用户是否已排队发送新年问候(或其他节日)电子邮件。
- 考虑到第1点和第2点,修改第一个工作程序,使其只执行数据库查询并为第2个工作程序填充所有工作。
请注意,上述方法需要小心,如果要发送数十亿封电子邮件(甚至数百万封),可能会超出Redis的存储限制。在这种情况下,您可能需要跳过第1点,只存储ID。如果您正在使用Heroku或类似的服务,您可能只需启动额外的dyno来处理它,而不是使用工作程序(第1点和第2点仍然可以提前完成,只需使用数据库中存储的数据,然后可以使用查询执行“find_each”循环,查找哪些用户仍然需要发送新年电子邮件)。
英文:
There is indeed an optimization you can do, actually a few optimizations.
- Get the actual information you need instead of the user ID so step 2 doesn't need to hit the database at all. IE: For a new years greeting, you will likely need the user's name (at least the first name) and their email address. The user ID is probably not helpful as you'll need to do a database query to look at the user to get the actual information you need. It is likely still useful to include it for other purposes though (see #3 below)
- Pre-schedule as many emails as possible. Everything in Step 1 can be done well in advance of when you actually need to send the emails. However, it is possible that some users may be missed (likely a very small percentage), so...
- Add columns to the database (or even an entire database table) to show if a user has had the New Year's greeting (or other holiday) emails queued to be sent. One column can be
new_years_greeting_email_queued_to_be_sent_at
and another can benew_years_greeting_email_sent_at
. The first field is used to flag that the first worker has been run. The second field shows the second worker has run. Add a method you can run to fetch all the users that have not yet been queued for sending this years email (make sure to test this well). You can then run that in advance for #1 and then again after you have sent out all the advance emails to catch anyone who has joined after you did #1 above. - With #1 and #2 in mind, change the first worker so it just does a database query and populates ALL of the jobs for worker #2, again in advance.
The bigger issue is that two workers may not be enough. A work might process what 10K emails a second? It'll take a bit more than a day to finish (need greater than 12K emails a second to finish within a day). Now if you meant several million, then at that rate you should be able to finish in time. Even if you only send about 12 emails a second to finish within a day, and assuming you aren't getting rate limited, 100 or so emails a second should be doable without too much trouble.
Though you'll definitely need to coordinate with whatever service you are using to send emails to make sure it won't be an issue.
Warning: Be careful about the above method, if you have billions of emails to send (or even millions), you may exceed the storage limit of redis. In this case, you may need to skip #1 and just store IDs. If you are using Heroku or something similar, you may just be able to spin up additional dynos to handle it instead of using workers (#1 and #2 can still be done in advance just with the data stored in the database, then you can just do a find_each
loop with a query to find which users still need their new years emails sent out.
答案2
得分: 1
只有一个工作人员获取一个电子邮件ID(不管是什么意思),然后在该工作人员中排队另一个工作人员以发送电子邮件(传递电子邮件ID)。
英文:
Just have one worker fetch an email id (whatever that means) and then in that worker enqueue another worker to send the email (passing the email id).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论