英文:
Microsoft Cosmosdb for Mongodb: merge unsharded collection into sharded ones
问题
我有两个相似文档的集合(即相同的对象,不同的值)。一个集合(X)未分片,存储在数据库A中,另一个集合(Y)分片存储在数据库B中。当我尝试将集合X复制到数据库B时,出现错误,提示“共享吞吐量集合应该有一个分区键”。我还尝试使用foreach insert复制数据,但花费的时间太长。
所以我的问题是,如何以高效的方式将集合X的数据追加到集合Y中?
CosmosDB上的Mongodb版本为3.4.6。
英文:
I have 2 collections of similar documents(i.e. same object, different values). One collection(X) is unsharded in database A, another collection(Y) is sharded and inside database B. When I try copy collection X into database B, I got error saying that "Shared throughput collection should have a partition key". I also tried copying data using foreach insert, but it takes too long time.
So my question is, how can I append the data from collection X to collection Y in efficient way?
Mongodb version on CosmosDB is 3.4.6
答案1
得分: 0
|$merge | $out |
| 可以执行聚合并在最后阶段添加$merge
操作符。 | 无法输出到分片集合。 |
| 输入集合也可以是分片的。 | 但是输入集合可以是分片的。 |
链接:https://docs.mongodb.com/manual/reference/operator/aggregation/merge/#comparison-with-out
英文:
You may perform aggregation and add as last stage $merge
operator.
| $merge | $out |
| Can output to a sharded collection. | Cannot output to a sharded collection. |
| Input collection can also be sharded. | Input collection, however, can be sharded. |
https://docs.mongodb.com/manual/reference/operator/aggregation/merge/#comparison-with-out
答案2
得分: 0
以下是要翻译的内容:
> 所以我的问题是,如何以高效的方式将集合X的数据附加到集合Y中?
可以使用服务器工具mongodump和mongorestore。您可以将源集合数据导出为BSON转储文件,然后导入目标集合。这些过程非常快速,因为数据库中的数据已经以BSON格式存在。
可以使用这些工具将非分片集合中的数据导出到分片集合中。在这种情况下,源集合必须具有分片键字段(或字段)的值。请注意,源集合的索引也会使用这些工具导出和导入。
以下是所讨论情况的示例:
mongodump --db=srcedb --collection=srcecoll --out="C:\mongo\dumps"
这将创建一个包含数据库名称的转储目录。其中将包含"srcecoll.bson"文件,用于导入。
mongorestore --port 26xxxx --db=trgtdb --collection=trgtcoll --dir="C:\mongo\dumps\srcecoll.bson"
主机/端口连接到分片集群的mongos
。请注意,BSON文件名需要在--dir
选项中指定。
导入操作将数据和索引添加到现有的分片集合中。该过程仅插入数据;无法更新现有文档。如果源集合的_id
值已存在于目标集合中,则该过程不会覆盖文档(这些文档将不会被导入,这不是错误)。
对于mongorestore
,还有一些有用的选项,如:--noIndexRestore
和--dryRun
。
英文:
> So my question is, how can I append the data from collection X to
> collection Y in efficient way?
The server tools mongodump and mongorestore can be used. You can export the source collection data into BSON dump files and import into the target collection. These processess are very quick, because the data in the database is already in BSON format.
Data can be exported from a non-sharded collection to a sharded collection using these tools. In this case, it is required that the source collection has the shard-key field (or fields) with values. Note the indexes from the source collection are also exported and imported (using these tools).
Here is an example of the scenario in question:
mongodump --db=srcedb --collection=srcecoll --out="C:\mongo\dumps"
This creates a dump directory with the database name. There will be "srcecoll.bson" file in it and it is used for importing.
mongorestore --port 26xxxx --db=trgtdb --collection=trgtcoll --dir="C:\mongo\dumps\srcecoll.bson"
The host/port connects to the mongos
of the sharded cluster. Note the bson file name need to be specified in the --dir
option.
The import adds data and indexes into the existing sharded collection. The process only inserts data; the existing documents cannot be updated. If the _id
value from the source collection already exists in the target collection, the process will not overwrite the documents (and those documents will not be imported, and it is not an error).
There are some useful options for mongorestore
like: --noIndexRestore
and --dryRun
.
答案3
得分: 0
因为CosmosDB中的MongoDB版本目前是3.4.6,不支持$merge以及许多其他命令,如collection.copyTo等。使用Studio 3T的导入功能也没有帮助。
我使用的解决方案是在本地MongoDB上下载目标集合,然后清理数据,然后编写Java代码,从本地数据库读取我的干净数据并插入到目标集合中,使用insertMany(或bulkwrite)。这样,数据将附加到目标集合中。
我测得的速度是1百万文档(约750MB)需要2小时,当然,这些数字可能会因各种因素而变化,例如网络、文档大小等。
英文:
Because, the MongoDb version in CosmosDB currently 3.4.6, it doesn't support $merge and a lot of other commands such as colleciton.copyTo etc. Using Studio 3T's import feature didn't help as well.
The solution I use, is to download the target collection on my local mongodb, clean it then write java code that will read my clean data from local db and insertMany(or bulkwrite) it to the target collection. This way, the data will be appended to the target collection.
The speed I measured was 2 hours for 1m document count(~750MB), of course, this numbers might vary depending on various factors, i.e. network, document size etc.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论