英文:
Efficient indexing of an emails collection for ordering & filtering by email domain
问题
我正在使用Mongoose来维护一个中央的电子邮件地址集合,我还有用户和组织的集合。在我的应用程序中,我通过它们(经过验证的)电子邮件域将用户与组织关联起来。例如,Acme Ltd拥有域名acme.com和acme.co.uk,通过从所有使用这些域名的电子邮件中选择,我可以汇总一个关联用户的唯一列表。
用户可以拥有多个电子邮件地址(1个主要+多个次要电子邮件)。用户不能共享电子邮件地址(因此有"verifiedBy"字段,它强制用户和电子邮件之间的一对一关系)。
我的模式(目前)如下:
const emailSchema = new Schema({
_id: {
type: String,
get: function idReverse(_id) { if(_id) return _id.split("@").reverse().join("@"); },
set: (str) => { str.trim().toLowerCase().split("@").reverse().join("@") }
},
verifiedBy: { type: String, ref: 'User' }
}, options );
我的问题是,是否值得在setter中反转电子邮件地址的域部分,并在getter中取消反转,如上所示,以便在_id上的底层MongoDb索引可以提高性能并使处理我描述的查询更容易?
我已经考虑过的替代方案有:
- 将电子邮件存储为原样并使用正则表达式选择域部分的用户(在我看来,处理方面感觉昂贵)
- 将域部分存储在单独的字段中并对其进行索引(感觉昂贵,因为会有两个索引和重复的数据存储)
英文:
I'm using Mongoose to hold a central collection of email addresses, and I also have collections for Users and Organisations. In my app I associate Users with Organisations through their (verified) email domains. E.g. Acme Ltd owns the domains acme.com and acme.co.uk, and by selecting from all emails using those domains, I can collate a unique list of associated users.
Users can have many email addresses (1 primary + numerous secondary emails). Users can't share email addresses (hence the "verifiedBy" field which enforces a one-to-one relationship between Users and Emails).
My schema is (currently) as follows:
const emailSchema = new Schema({
_id: {
type: String,
get: function idReverse(_id) { if(_id) return _id.split("@").reverse().join("@"); },
set: (str) => { str.trim().toLowerCase().split("@").reverse().join("@") }
},
verifiedBy: { type: String, ref: 'User' }
}, options );
> My question is whether it is worth reversing the domain parts of the
> email address in the setter, and unreversing them in the getter - as
> I've shown - in order that the underlying MongoDb index on _id can improve
> performance & make it easier to deal with the kinds of lookups I've
> described?
The alternatives I've already considered are:
- Storing the email as is and using regex to select users by domain part (feels expensive to me processing-wise)
- Storing the domain part in a separate field and indexing that (feels expensive as there'd be two indexes, and duplicated data storage)
答案1
得分: 1
第一个选项实际上应该运行得相当不错。根据$regex
文档:
> [...] 如果正则表达式是“前缀表达式”,则可以进行进一步优化,这意味着所有潜在的匹配都以相同的字符串开头。 [...]
>
> 如果正则表达式以插入符(^)或左锚点(\A)开头,后跟一串简单的符号,那么正则表达式就是“前缀表达式”。
<br>
实验
让我们看看在一个包含约800k个文档,其中约25%的文档包含电子邮件的集合上如何运行。分析的示例查询是{email: /^gmail/}
。
没有索引:
db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2208,
// "executionTimeMillis" : 250,
// "totalKeysExamined" : 0,
// "totalDocsExamined" : 202720,
// ...
使用 {email: 1}
索引:
db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2208,
// "executionTimeMillis" : 5,
// "totalKeysExamined" : 2209,
// "totalDocsExamined" : 2208,
// ...
正如我们所看到的,它确实有所帮助 - 无论是在执行时间还是在检查的文档方面(检查更多文档可能意味着更多的IO工作)。让我们看看如果我们忽略前缀并直接使用查询:{email: /gmail/}
会有什么效果。
没有索引:
db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2217,
// "executionTimeMillis" : 327,
// "totalKeysExamined" : 0,
// "totalDocsExamined" : 202720,
// ...
使用 {email: 1}
索引:
db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2217,
// "executionTimeMillis" : 210,
// "totalKeysExamined" : 200616,
// "totalDocsExamined" : 2217,
// ...
最终,索引确实有很大的帮助,特别是在执行前缀查询时。看起来前缀查询足够快,可以保持原样,放在单个字段中。一个单独的字段 可能 会更好地利用索引(可以尝试一下),但我认为5毫秒已经足够好了。
一如既往,我强烈建议您在自己的数据上进行测试,以查看其性能如何,因为数据特性可能会影响性能。
英文:
The first options should actually work pretty well. According to the $regex
docs:
> [...] Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. [...]
>
> A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. [...]
<br>
Experiment
Let's check how it works on a collection with ~800k docs, ~25% of them have an email. The analyzed example query is {email: /^gmail/}
.
Without an index:
db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2208,
// "executionTimeMillis" : 250,
// "totalKeysExamined" : 0,
// "totalDocsExamined" : 202720,
// ...
With a {email: 1}
index:
db.users.find({email: /^gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2208,
// "executionTimeMillis" : 5,
// "totalKeysExamined" : 2209,
// "totalDocsExamined" : 2208,
// ...
As we see, it definitely helps - both in terms of execution time and examined docs (more examined docs means possibly more IO work). Let's see how it works if we'll ignore the prefix and use the query more directly: {email: /gmail/}
.
Without an index:
db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2217,
// "executionTimeMillis" : 327,
// "totalKeysExamined" : 0,
// "totalDocsExamined" : 202720,
// ...
With a {email: 1}
index:
db.users.find({email: /gmail/}).explain('executionStats').executionStats
// ...
// "nReturned" : 2217,
// "executionTimeMillis" : 210,
// "totalKeysExamined" : 200616,
// "totalDocsExamined" : 2217,
// ...
In the end, the index helps a lot, especially when performing a prefixed query. It looks like the prefixed query is fast enough to keep it as it is, in a single field. A separate field may utilize the index even better (play with it!), but 5ms is good enough, I think.
As always, I'd strongly encourage you to perform tests on your data and see how it performs, as the data characteristic may impact the performance.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论