Mongodb对于数百万条记录的数组$in查询执行速度更快。

huangapple go评论64阅读模式
英文:

Mongodb faster execution for $in for array for millions of records

问题

我有两个MongoDB集合:
1). 第一个集合是: sourcequeuemanualupload。它包含所有状态为Unworked或Assigned的记录。这包含了数据上传时的记录。以下是数据库中已分配项记录的示例:

{
    "_id": ObjectId("63e0e46a6047d75b9c20d8ec"),
    "Properties: Name": "Hangman - Guess Words",
    "Appstore URL": "https://itunes.apple.com/app/id1375993101?hl=None",
    "Region": "na",
    "Create Date": "na",
    "AHT": "1",
    "sourceId": "63e0e3719b4f812ba5333a31",
    "type": "Manual",
    "uploadTime": "2023-02-06T11:28:42.533+0000",
    "status": "Assigned",
    "batchId": "63e0e3719b4f812ba5333a31_746f22e4319b4d81b8ab255f5e653c2c_612023112842"
}

2). 第二个集合是queuedata。它包含已处理的项目(来自sourcequeuemanualupload集合)的数据(问题、用户在工具上的响应),并将它们全部保存在该集合中,状态为Completedsourcequeuemanualupload记录的_id存储在这里作为“id”字段,作为标识符。以下是已完成项目的示例:

{
    "_id": ObjectId("63e0e4b19b4f812ba5333a34"),
    "templateId": "63e0e28e9b4f812ba5333a30",
    "id": "63e0e46a6047d75b9c20d8ec",
    "moderator": "kodaga",
    "startTime": "2023-02-06T11:29:46.048Z",
    "endTime": "2023-02-06T11:29:52.438Z",
    "status": "Completed",
    "AHT": NumberLong(6),
    "userInput": [
            {
                    "question": "Is the URL leading to the desired store page link?",
                    "response": "yes"
            },
            {
                    "question": "Comments, if any.",
                    "response": "test 1"
            }
    ]
}

最初DBA犯了一个错误,他没有在sourcequeuemanualupload集合中更新已分配项目的状态为“Completed”。因此,已分配的项目仍然保持为已分配。我们正在修复这个问题,将所有已完成的已分配项目标记为“Completed”。

当前数据量:

> db.sourcequeuemanualupload.count()
414781
> db.sourcequeuemanualupload.count({"status": "Assigned"})
306418
> db.queuedata.count()
298128

我目前编写的脚本用于识别已完成的记录并将它们标记为已完成:

var assigned_tasks_arr = [];

db.sourcequeuemanualupload.find({ status: "Assigned" }).forEach(function (rec) {
  assigned_tasks_arr.push(rec._id.str);
});

print(assigned_tasks_arr.length);

db.queuedata.count({ id: { $in: assigned_tasks_arr }, status: "Completed" }); // <------------- 步骤1

var completed_items = db.queuedata
  .find(
    { id: { $in: assigned_tasks_arr }, status: "Completed" },
    { _id: 0, id: 1 }
  )
  .toArray(); //    <------------- 步骤2

var completed_items_ids = [];

completed_items.forEach(function (rec) {
  completed_items_ids.push(rec.id);
});

var completed_items_unique_objectid = [];

completed_items.forEach(function (item) {
  completed_items_unique_objectid.push(new ObjectId(item));
});

db.sourcequeuemanualupload.updateMany(
  { _id: { $in: completed_items_unique_objectid } },
  { $set: { status: "Completed" } }
); // <------------- 步骤3

基本上,我从sourcequeuemanualupload集合中获取了所有“Assigned”状态项目的_id,并将它们存储在一个数组中。接下来,为了找出它们是否确实已经完成并且记录在queuedata中,我使用$in和assigned_tasks_arr执行搜索以获取实际已完成的项目。一旦我获得已完成的id列表,我希望继续更新sourcequeuemanualupload中的状态。

然而,我的步骤1和步骤2执行时间超过了30分钟。它可能需要一个小时或更长时间(但我无法获得确切的时间,因为我的会话已经终止)。我认为带有数组的$in查询执行时间为(N*log(M)),其中N是输入数组的长度,M是集合的大小。正如你所看到的,我的N的值在百万级别,M也在百万级别,所以这个查询需要更长时间。如何优化这个查询或编写更快的查询来查找id并一次更新它们?

我需要帮助来更快执行上面脚本中的步骤1、步骤2和步骤3,以便我不必等待几个小时。我们可以在这里编写更好的连接查询或其他方法来优化查询吗?

谢谢。

我尝试在线搜索更快的优化方法。我已将'queuedata'集合中的索引添加为'id',但仍然需要很长时间 Mongodb对于数百万条记录的数组$in查询执行速度更快。

注意:我使用AWS DocumentDB 4.0.0,如果您在查询上提供任何建议,请检查链接:https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html,因为某些命令不受支持。

英文:

I have two mongodb collections:
1). First collection is: sourcequeuemanualupload . It contains all records with statuses either Unworked or Assigned. This contains records when data is uploaded. Example of one of the Assigned item record in DB:

{
    &quot;_id&quot; : ObjectId(&quot;63e0e46a6047d75b9c20d8ec&quot;),
    &quot;Properties: Name&quot; : &quot;Hangman - Guess Words&quot;,
    &quot;Appstore URL&quot; : &quot;https://itunes.apple.com/app/id1375993101?hl=None&quot;,
    &quot;Region&quot; : &quot;na&quot;,
    &quot;Create Date&quot; : &quot;na&quot;,
    &quot;AHT&quot; : &quot;1&quot;,
    &quot;sourceId&quot; : &quot;63e0e3719b4f812ba5333a31&quot;,
    &quot;type&quot; : &quot;Manual&quot;,
    &quot;uploadTime&quot; : &quot;2023-02-06T11:28:42.533+0000&quot;,
    &quot;status&quot; : &quot;Assigned&quot;,
    &quot;batchId&quot; : &quot;63e0e3719b4f812ba5333a31_746f22e4319b4d81b8ab255f5e653c2c_612023112842&quot;
}

2). Second collection is queuedata. It contains the data(questions, responses user takes on tool) for items(from sourcequeuemanualupload collection) worked upon and they are all saved in this collection with a status of Completed. The _id of the record in sourcequeuemanualupload is stored here as "id" field as an identifier. Example of completed item:

{
    &quot;_id&quot; : ObjectId(&quot;63e0e4b19b4f812ba5333a34&quot;),
    &quot;templateId&quot; : &quot;63e0e28e9b4f812ba5333a30&quot;,
    &quot;id&quot; : &quot;63e0e46a6047d75b9c20d8ec&quot;,
    &quot;moderator&quot; : &quot;kodaga&quot;,
    &quot;startTime&quot; : &quot;2023-02-06T11:29:46.048Z&quot;,
    &quot;endTime&quot; : &quot;2023-02-06T11:29:52.438Z&quot;,
    &quot;status&quot; : &quot;Completed&quot;,
    &quot;AHT&quot; : NumberLong(6),
    &quot;userInput&quot; : [
            {
                    &quot;question&quot; : &quot;Is the URL leading to the desired store page link?&quot;,
                    &quot;response&quot; : &quot;yes&quot;
            },
            {
                    &quot;question&quot; : &quot;Comments, if any.&quot;,
                    &quot;response&quot; : &quot;test 1&quot;
            }
    ]
}

There was a mistake made by the DBA initially as he did not update the status of Assigned items that were completed to Completed in sourcequeuemanualupload collection. So the Assigned items are left as Assigned itself. We are working on a fix for this to mark all the Assigned items(that have been worked upon) to Completed.

Volume of data currently:

&gt; db.sourcequeuemanualupload.count()
414781
&gt; db.sourcequeuemanualupload.count({&quot;status&quot;:&quot;Assigned&quot;})
306418
&gt; db.queuedata.count()
298128

Script I have written currently to identify the completed records and mark them as completed:

var assigned_tasks_arr = [];

db.sourcequeuemanualupload.find({ status: &quot;Assigned&quot; }).forEach(function (rec) {
  assigned_tasks_arr.push(rec._id.str);
});

print(assigned_tasks_arr.length)
&gt; 306418

db.queuedata.count({ id: { $in: assigned_tasks_arr }, status: &quot;Completed&quot; }); // &lt;------------- STEP 1

var completed_items = db.queuedata
  .find(
    { id: { $in: assigned_tasks_arr }, status: &quot;Completed&quot; },
    { _id: 0, id: 1 }
  )
  .toArray(); //    &lt;------------- STEP 2

var completed_items_ids = [];

completed_items.forEach(function (rec) {
  completed_items_ids.push(rec.id);
});

var completed_items_unique_objectid = [];

completed_items.forEach(function (item) {
  completed_items_unique_objectid.push(new ObjectId(item));
});

db.sourcequeuemanualupload.updateMany(
  { _id: { $in: completed_items_unique_objectid } },
  { $set: { status: &quot;Completed&quot; } }
); // &lt;------------- STEP 3

Basically, I fetch all the _ids of Assigned status items from sourcequeuemanualupload collection and store them in an array. Next to find if they actually have been completed and record is present in queuedata, I perform a search using $in and assigned_tasks_arr to get the actually completed items. Once i get the completed ids list, i would like to go ahead and update the statuses in sourcequeuemanualupload.

However, my STEP 1 and STEP 2 are taking>30 minutes for execution. It must be taking more than an hour or longer(but i could not get exact time as my session got terminated). I think the $in with array find query execution takes (N*log(M)), where N is the length of input array and M is the size of the collection. As you see my value of N is in millions and M is also in millions, this query will take a longer time. How do i optimize this query or write a faster query to find the ids and update at once?

I need help with faster execution of STEP 1, STEP 2 and STEP 3 in above script so that I dont have to wait for hours. Can we write a better join query here or any other way to optimize the query?

Thanks

I tried searching online for faster ways to optimize. I added index as 'id' to queuedata collection but it is still taking a long time Mongodb对于数百万条记录的数组$in查询执行速度更快。
Need help for the above query

NOTE: using AWS documentDB 4.0.0 as if you are providing any suggestions on query , please check link: https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html as some commands are not supported

答案1

得分: 0

以下是您要翻译的代码部分:

var completed_ids = []
db.sourcequeuemanualupload.aggregate([
   { $match: { status: "Assigned" } },
   {
      $lookup: {
         from: "queuedata",
         let: { id: "$_id" },
         pipeline: [
            { $match: { status: "Completed" } },
            { $match: { $expr: { $eq: "$$id", { $toObjectId: "$id" } } } }
         ],
         as: "queuedata"
      }
   },
   { $match: { queuedata: { $ne: [] } } },
   { $project: { _id: 1 } }
]).forEach(rec => {
   completed_ids.push(rec._id);
   if (completed_ids.length > 10000) {
      db.sourcequeuemanualupload.updateMany(
         { _id: { $in: completed_ids } },
         { $set: { status: "Completed" } }
      );
      completed_ids = [];
   }
})
if (completed_ids.length > 0) {
   db.sourcequeuemanualupload.updateMany(
      { _id: { $in: completed_ids } },
      { $set: { status: "Completed" } }
   );
}

MongoDB version 4.0 is [End of Life][1], you may look for a modern MongoDB hosting service.

Another very simple approach would be this one:

db.queuedata.aggregate([
   { $match: { status: "Completed" } },
   {
      $project: {
         _id: { $toObjectId: "$id" },
         status: 1
      }
   },
   {
      $merge: {
         into: "sourcequeuemanualupload",
         whenMatched: "merge",
         whenNotMatched: "discard"
      }
   }
])

But this will update all documents in `sourcequeuemanualupload`, no matter if `status: 'Assigned'` or not.

A solution would be this one:

var completed_ids = [];
db.queuedata.aggregate([
   { $match: { status: "Completed" } },
   { $project: { _id: { $toObjectId: "$id" } } }
]).forEach(rec => {
   completed_ids.push(rec._id);
   if (completed_ids.length > 10000) {
      db.sourcequeuemanualupload.updateMany(
         {
            _id: { $in: completed_ids },
            status: "Assigned"
         },
         { $set: { status: "Completed" } }
      );
      completed_ids = [];
   }
})
if (completed_ids.length > 0) {
   db.sourcequeuemanualupload.updateMany(
      {
         _id: { $in: completed_ids },
         status: "Assigned"
      },
      { $set: { status: "Completed" } }
   );
}

When you going to fix your data, then `id` should be converted to `ObjectId` rather than a plain string.

[1]: https://www.mongodb.com/support-policy/lifecycles

如果您需要进一步的信息或翻译,请随时告诉我。

英文:

I would try it like this:

var completed_ids = []
db.sourcequeuemanualupload.aggregate([
{ $match: { status: &quot;Assigned&quot; } },
{
$lookup: {
from: &quot;queuedata&quot;,
let: { id: &quot;$_id&quot; },
pipeline: [
{ $mach: { status: &quot;Completed&quot; } },
{ $match: { $expr: { $eq: [&quot;$$id&quot;, { $toObjectId: &quot;$id&quot; }] } } }
],
as: &quot;queuedata&quot;
}
},
{ $match: { queuedata: { $ne: [] } } },
{ $project: { _id: 1 } }
]).forEach(rec =&gt; {
completed_ids.push(rec._id);
if (completed_ids.length &gt; 10000) {
db.sourcequeuemanualupload.updateMany(
{ _id: { $in: completed_ids } },
{ $set: { status: &quot;Completed&quot; } }
);
completed_ids = [];
}
})
if (completed_ids.length &gt; 0) {
db.sourcequeuemanualupload.updateMany(
{ _id: { $in: completed_ids } },
{ $set: { status: &quot;Completed&quot; } }
);
}

MongoDB version 4.0 is End of Life, you may look for a modern MongoDB hosting service.

Another very simple approach would be this one:

db.queuedata.aggregate([
{ $match: { status: &quot;Completed&quot; } },
{
$project: {
_id: { $toObjectId: &quot;$id&quot; },
status: 1
}
},
{
$merge: {
into: &quot;sourcequeuemanualupload&quot;,
whenMatched: &quot;merge&quot;,
whenNotMatched: &quot;discard&quot;
}
}
])

But this will update all documents in sourcequeuemanualupload, no matter if status: &#39;Assigned&#39; or not.

A solution would be this one:

var completed_ids = [];
db.queuedata.aggregate([
{ $match: { status: &quot;Completed&quot; } },
{ $project: { _id: { $toObjectId: &quot;$id&quot; } } }
]).forEach(rec =&gt; {
completed_ids.push(rec._id);
if (completed_ids.length &gt; 10000) {
db.sourcequeuemanualupload.updateMany(
{
_id: { $in: completed_ids },
status: &quot;Assigned&quot;
},
{ $set: { status: &quot;Completed&quot; } }
);
completed_ids = [];
}
})
if (completed_ids.length &gt; 0) {
db.sourcequeuemanualupload.updateMany(
{
_id: { $in: completed_ids },
status: &quot;Assigned&quot;
},
{ $set: { status: &quot;Completed&quot; } }
);
}

When you going to fix your data, then id should be converted to ObjectId rather than a plain string.

huangapple
  • 本文由 发表于 2023年3月12日 09:56:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710669.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定