2023年6月13日 01:10:19go评论74阅读模式

英文:

Mongo db query optimization

问题

我正在使用MongoDB来存储通知，每天午夜都会运行一个cron作业来选择旧的通知并删除它们。

删除策略有一个逻辑，以保留每个组合的最旧通知 "sourceObjectIdentifier" 和 "notificationType"。查询1的聚合输出提供给查询2以形成删除查询。

查询1 -

db.persistentEvent.aggregate([
  {
    &quot;$match&quot;: {
      &quot;notificationClass&quot;: {&quot;$eq&quot;: &quot;category&quot;}
    }
  },
  {
    &quot;$group&quot;: {
      &quot;_id&quot;: [
        {&quot;sourceObjectIdentifier&quot;: &quot;$sourceObjectIdentifier&quot;},
        {&quot;notificationType&quot;: &quot;$notificationType&quot;}
      ],
      &quot;notificationId&quot;: {&quot;$last&quot;: &quot;$notificationId&quot;},
      &quot;notificationClass&quot;: {&quot;$last&quot;: &quot;$notificationClass&quot;}
    }
  },
  {
    &quot;$project&quot;: {
      &quot;_id&quot;: 0,
      &quot;notificationId&quot;: 1,
      &quot;notificationClass&quot;: 1
    }
  },
  {
    &quot;$sort&quot;: {
      &quot;sourceObjectIdentifier&quot;: 1,
      &quot;notificationType&quot;: 1,
      &quot;creationTime&quot;: -1
    }
  }
])

查询2 -

{
  &quot;deliveryTime&quot;: {
    &quot;$gt&quot;: &quot;$datefrom&quot;
  },
  &quot;creationTime&quot;: {
    &quot;$lt&quot;: &quot;dateto&quot;
  },
  &quot;notificationId&quot;: {
    &quot;$nin&quot;: [query_1_output]
  }
}

如果集合中的通知数量较少，此删除逻辑可正常运行，但如果通知数量在200k或更多，则开始花费很长时间。

有人可以建议如何改进查询，查询1和2是否可以一起执行？

英文:

I am using mongo db to store notifications, a cron job runs every midnight to pick old notification and delete them.

The deletion policy has a logic to retain the oldest notification of each combination of "sourceObjectIdentifier" and "notificationType". The aggregated output of query 1 is given to query 2 to form the deletion query.

query 1 -

db.persistentEvent.aggregate([
  {
    &quot;$match&quot;: {
      &quot;notificationClass&quot;: {&quot;$eq&quot;: &quot;category&quot;}
    }
  },
  {
    &quot;$group&quot;: {
      &quot;_id&quot;: [
        {&quot;sourceObjectIdentifier&quot;: &quot;$sourceObjectIdentifier&quot;},
        {&quot;notificationType&quot;: &quot;$notificationType&quot;}
      ],
      &quot;notificationId&quot;: {&quot;$last&quot;: &quot;$notificationId&quot;},
      &quot;notificationClass&quot;: {&quot;$last&quot;: &quot;$notificationClass&quot;}
    }
  },
  {
    &quot;$project&quot;: {
      &quot;_id&quot;: 0,
      &quot;notificationId&quot;: 1,
      &quot;notificationClass&quot;: 1
    }
  },
  {
    &quot;$sort&quot;: {
      &quot;sourceObjectIdentifier&quot;: 1,
      &quot;notificationType&quot;: 1,
      &quot;creationTime&quot;: -1
    }
  }
])

query 2 -

{
  &quot;deliveryTime&quot;: {
    &quot;$gt&quot;: &quot;$datefrom&quot;
  },
  &quot;creationTime&quot;: {
    &quot;$lt&quot;: &quot;dateto&quot;
  },
  &quot;notificationId&quot;: {
    &quot;$nin&quot;: [query_1_output]
  }
}

This deletion logic works fine if the number of notifications in the collection is less but it starts to take a long time when the number of notifications are in range 200k or above.

Can someone please suggest how can I improve the query and can the query 1 and 2 be executed together?

答案1

得分: 2

以下是翻译好的部分：

你如何定义“最新通知”并不清楚。
是最近插入的通知吗？
是具有最大的notificationId的通知吗？
是具有最新的creationTime的通知吗？

首先确保在notificationClass上有一个索引。一个由{sourceObjectIdentifier: 1, notificationType: 1}组成的复合索引也可能有所帮助。

使用$setWindowFields来查找最新的通知（排名 = 1），并删除其他所有通知。下面的命令将“最新通知”视为具有最新creationTime的通知。

var deleteDocs = [];
db.persistentEvent.aggregate([
   { $match: { notificationClass: "category" } },
   {
      $setWindowFields: {
         partitionBy: {
            sourceObjectIdentifier: "sourceObjectIdentifier",
            notificationType: "$notificationType"
         },
         sortBy: { creationTime: -1 },
         output: {
            rank: { $rank: {} },
         }
      }
   },
   { $match: { rank: { $gt: 1 } } },
   { $project: { _id: 1 } }
]).toArray().forEach(doc => {
   deleteDocs.push(doc._id);
   if (deleteDocs.length > 10000) {
      db.persistentEvent.deleteMany({ _id: { $in: deleteDocs } });
      deleteDocs = [];
   }
})
if (deleteDocs.length > 0)
   db.persistentEvent.deleteMany({ _id: { $in: deleteDocs } });

在Mongo Playground中查看

英文:

It is not clear how you define "newest notification".

The notification which was inserted most recently?
The notification with the greatest notificationId?
The notification with the newest creationTime?

First ensure that you have an index on the notificationClass. A composed index on {sourceObjectIdentifier: 1, notificationType: 1} may also help.

Use $setWindowFields to find the newest notifications (rank = 1), and delete all others. Below command considers "newest notification" as notification with the newest creationTime.

var deleteDocs = [];
db.persistentEvent.aggregate([
   { $match: { notificationClass: &quot;category&quot; } },
   {
      $setWindowFields: {
         partitionBy: {
            sourceObjectIdentifier: &quot;sourceObjectIdentifier&quot;,
            notificationType: &quot;$notificationType&quot;
         },
         sortBy: { creationTime: -1 },
         output: {
            rank: { $rank: {} },
         }
      }
   },
   { $match: { rank: { $gt: 1 } } },
   { $project: { _id: 1 } }
]).toArray().forEach(doc =&gt; {
   deleteDocs.push(doc._id);
   if (deleteDocs.length &gt; 10000) {
      db.persistentEvent.deleteMany({ _id: { $in: deleteDocs } });
      deleteDocs = [];
   }
})
if (deleteDocs.length &gt; 0)
   db.persistentEvent.deleteMany({ _id: { $in: deleteDocs } });

Mongo Playground

答案2

得分: 0

如所述，查询1将扫描比实际必要更多的文档。
此外，当遇到排序阶段时，分组阶段和项目阶段将删除除 notificationId 和 notificationClass 之外的所有字段。这意味着排序对象中的字段将不存在，因此排序将不起作用。

为了改进此查询：

将排序阶段移到分组阶段之前
创建一个包括匹配字段和排序字段的索引
如果此查询中提到的5个字段不构成每个文档大小的绝大部分，还可以将 notificationId 添加到索引中。

如果查询执行器能够根据索引选择和排序文档，这将消除加载存储中的非匹配文档的需求，并避免内存中的排序。这两者都将有助于性能。

如果文档明显大于 notificationClass、notificationId、sourceObjectIdentifier、notificationType 和 creationTime 的组合，那么创建包含这些字段的索引，遵循ESR规则，将允许查询1仅使用索引中包含的信息来满足，使其成为一个覆盖查询：

{
   "notificationClass": 1,
   "sourceObjectIdentifier": 1,
   "notificationType": 1,
   "creationTime": -1,
   "notificationId": 1
}

使用上述索引，随着集合的增长，此流水线应该比现有查询性能显着提高：

[
  {
    "$match": {
      "notificationClass": {"$eq": "category"}
    }
  },
  {
    "$sort": {
      "sourceObjectIdentifier": 1,
      "notificationType": 1,
      "creationTime": -1
    }
  },
  {
    "$group": {
      "_id": [
        {"sourceObjectIdentifier": "$sourceObjectIdentifier"},
        {"notificationType": "$notificationType"}
      ],
      "notificationId": {"$last": "$notificationId"},
      "notificationClass": {"$last": "$notificationClass"}
    }
  },
  {
    "$project": {
      "_id": 0,
      "notificationId": 1,
      "notificationClass": 1
    }
  }
]

英文:

As written, query 1 will scan far more documents than is really necessary.
Also, by the time the sort stage is encountered, the group stage and project stage will have removed all fields except notificationId and notificationClass. This means none of the fields in the sort object will exist, so that sort will accomplish nothing.

To improve this query:

move the sort stage to before the group stage
create an index that includes the match field and the sort fields
if the 5 fields mention in this query do not constitute the vast majority of the size of each document, also add notificationId to the index.

If the query executor is able to select and sort the documents based on the index, that will eliminate the need to load non-matching documents from storage, and will avoid an in-memory sort. Both of these will help performance.

If the documents are significantly larger than the combination of notificationClass, notificationId, sourceObjectIdentifier, notificationType, and creationTime, creating an index containing those fields, following the ESR rule will permit query 1 to be satisfied using only the information included in the index, making it a covered query:

{
   notificationClass: 1,
   sourceObjectIdentifier: 1,
   notificationType: 1,
   creationTime: -1,
   notificationId: 1
}

With the above index this pipeline should perform significantly better than the existing query as the collection grows:

[
  {
    &quot;$match&quot;: {
      &quot;notificationClass&quot;: {&quot;$eq&quot;: &quot;category&quot;}
    }
  },
  {
    &quot;$sort&quot;: {
      &quot;sourceObjectIdentifier&quot;: 1,
      &quot;notificationType&quot;: 1,
      &quot;creationTime&quot;: -1
    }
  },
  {
    &quot;$group&quot;: {
      &quot;_id&quot;: [
        {&quot;sourceObjectIdentifier&quot;: &quot;$sourceObjectIdentifier&quot;},
        {&quot;notificationType&quot;: &quot;$notificationType&quot;}
      ],
      &quot;notificationId&quot;: {&quot;$last&quot;: &quot;$notificationId&quot;},
      &quot;notificationClass&quot;: {&quot;$last&quot;: &quot;$notificationClass&quot;}
    }
  },
  {
    &quot;$project&quot;: {
      &quot;_id&quot;: 0,
      &quot;notificationId&quot;: 1,
      &quot;notificationClass&quot;: 1
    }
  }
]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

MongoDB查询优化

问题

答案1

答案2

使用mongo.Connect时连接未打开，但在执行查询时会打开连接。

创建一个日期数组的有效方法？

MongoDB 更新集合文档

将聚合中的 `_id`（ObjectId）转换为字符串以在Spring Boot的lookup中使用。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论