英文:
efficient way to iterate over large data sets in go & mongodb
问题
更多是一个普遍性的问题,但这是我的问题。我们有两个数据库,都是Mongo数据库。我们正在将数据从一个Mongo实例迁移到另一个实例,我们应用程序的v1和v2之间的差异意味着需要对数据进行重构。问题在于,有一个特定的集合大约有84万条记录。获取这840条记录并不是问题。我需要将数据结构转换为新的数据库模式。重构的一部分是将3个其他集合的新生成的objectId嵌入到迁移的特定部分中。
例如:
患者集合,
护理人员集合,
机构集合,
它们都没有ID导入,并由Mongo生成。
它们都通过新生成的objectID在840k条访问集合记录中进行引用。
我的问题是,当我遍历所有4个集合以对齐数据时,这个过程需要多长时间,我甚至不知道我让它运行了多长时间,4个小时后我才终止它。
现在,当我处理较小的集合时,这段代码是有效的,但是处理这个集合比我们等待的时间还要长。我的问题是,将840k条记录与32个机构、8000个患者和6000个护理人员对齐的最省时的方法是什么。我做了一些计算,它的遍历次数比我的计算器能够输出的数字还要多。顺便说一句,我不关心资源消耗,因为我们可以为运行应用程序的时间启动任何大小的云计算机。只是一种高效获取这3个其他集合中每个840k次访问的对象ID的方法。我曾考虑过将这些集合导入到旧数据库中,并在输入的聚合期间查找ID,但那不是很程序化的方法。
修改后的带有缓存的对象ID的遍历
for _, m := range model {
agency, agencyFound := c.Get(*m.AgencyName)
uid, uidFound := c.Get(fmt.Sprintf("caregiver-%v-%d", agency, *m.CaregiverID))
patientId, patientFound := c.Get(fmt.Sprintf("patient-%v-%d", agency, *m.PatientID))
if agencyFound && uidFound && patientFound {
visit := &models.Visit{
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
AgencyID: fmt.Sprintf("%v", agency),
ScheduleID: *m.ScheduleID,
Status: *m.Status,
Start: *m.Start,
End: *m.End,
PatientID: fmt.Sprintf("%v", patientId),
UserID: fmt.Sprintf("%v", uid),
}
updateModel = append(updateModel, mongo.NewUpdateOneModel().SetFilter(
bson.D{
{Key: "agencyId", Value: fmt.Sprintf("%v", agency)},
{Key: "userId", Value: fmt.Sprintf("%v", uid)},
{Key: "patientId", Value: fmt.Sprintf("%v", patientId)},
{Key: "scheduleId", Value: m.ScheduleID},
{Key: "start", Value: m.Start},
{Key: "end", Value: m.End},
},
).SetUpdate(
bson.D{{Key: "$set", Value: visit}},
).SetUpsert(true))
}
}
英文:
More of a general question than anything, but here is my problem. We have 2 databases, both are mongo database. We are in the process of migrating data from one mongo instance to another, difference between v1 and v2 of our app meant a refactoring of said data. Problem here in lies. one particular collection holds roughly 840 thousand records. Getting those 840 records is not the problem. I need to convert the data structure to the new database schema. Part of that refactoring was embedding the newly generated objectId's of 3 other collections in this particular part of the migration.
Example would be
patients collection,
caregivers collection,
agencies collection,
these are imported without ids and are generated by mongo
they are all referenced in the 840k records in visits collection by there newly generated objectID's
my problem lies when I range over all 4 collections to line up the data this process is taking I don't even know how long I left it running for 4 hours before I killed it.
for _, a := range agencies {
for _, c := range caregivers {
for _, p := range patients {
for _, m := range model {
if a.Name == *m.AgencyName && m.CaregiverID == &c.CareGiverID && m.PatientID == &p.PatientID {
visit := &models.Visit{
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
AgencyID: a.ID.Hex(),
ScheduleID: *m.ScheduleID,
Status: *m.Status,
Start: *m.Start,
End: *m.End,
PatientID: p.ID.Hex(),
UserID: c.ID.Hex(),
}
updateModel = append(updateModel, mongo.NewUpdateOneModel().SetFilter(
bson.D{
{Key: "agencyId", Value: a.ID.Hex()},
{Key: "userId", Value: c.ID.Hex()},
{Key: "patientId", Value: p.ID.Hex()},
{Key: "scheduleId", Value: m.ScheduleID},
{Key: "start", Value: m.Start},
{Key: "end", Value: m.End},
},
).SetUpdate(
bson.D{{Key: "$set", Value: visit}},
).SetUpsert(true))
}
}
}
}
}
Now that code works for the smaller collections when I was doing them but this one is taking longer than we have time to wait around for. My question is, what is the least time consuming way of lining up 840k records to 32 agencies, 8000 patients, 6000 caregivers. I did some math and it's ranging more times than my calculator could spit out a number for. FWI I don't care about resource consumption as we can spin up a cloud machine of any size for the time it takes to run the app. Just an efficient way of getting those object id's from the 3 other collections for each of the 840k visits. I did have a though of importing those collections to the old database and lookup the id's during the aggregate for the input, but that is not very programatic.
Modified range with cached objectIds
for _, m := range model {
agency, agencyFound := c.Get(*m.AgencyName)
uid, uidFound := c.Get(fmt.Sprintf("caregiver-%v-%d", agency, *m.CaregiverID))
patientId, patientFound := c.Get(fmt.Sprintf("patient-%v-%d", agency, *m.PatientID))
if agencyFound && uidFound && patientFound {
visit := &models.Visit{
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
AgencyID: fmt.Sprintf("%v", agency),
ScheduleID: *m.ScheduleID,
Status: *m.Status,
Start: *m.Start,
End: *m.End,
PatientID: fmt.Sprintf("%v", patientId),
UserID: fmt.Sprintf("%v", uid),
}
updateModel = append(updateModel, mongo.NewUpdateOneModel().SetFilter(
bson.D{
{Key: "agencyId", Value: fmt.Sprintf("%v", agency)},
{Key: "userId", Value: fmt.Sprintf("%v", uid)},
{Key: "patientId", Value: fmt.Sprintf("%v", patientId)},
{Key: "scheduleId", Value: m.ScheduleID},
{Key: "start", Value: m.Start},
{Key: "end", Value: m.End},
},
).SetUpdate(
bson.D{{Key: "$set", Value: visit}},
).SetUpsert(true))
}
}
答案1
得分: 1
机构、护理人员、患者和访问的笛卡尔积为32x8000x6000x840000,约为10^15。
假设内部if语句需要一个CPU周期来计算(实际上需要更多),使用4千兆赫的CPU将需要超过89小时才能完成。
为每个访问生成所有可能的(机构、护理人员、患者)组合是一种巨大的资源浪费。
反向方法:遍历所有访问,并获取该特定文档的机构、护理人员和患者。为了节省访问数据库的时间,可以在内存中存储所需的ID的映射。
英文:
Cartesian product of agencies, caregivers, patients and visits is 32x8000x6000x840000 is about 10^15.
Let's suppose the inner if takes one CPU cycle to be computed (it takes much more) with a 4 gigahertz CPU it will take more than 89 hours to complete.
Generating all possible combinations of (agencies, caregivers, patients) for each visit is a massive waste of resources.
Reverse approach: iterate over all visits and fetch the agency, caregiver and patient for that particular document. To save time accessing db it is feasible to store a mapping in memory with the IDs needed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论