英文:
How to reduce parallel inserted time in MongoDB?
问题
我有一个托管在Azure上的虚拟机,使用的是标准 D2ds v5(2个CPU核心,8GB内存)。
我创建了一个控制台应用程序,用于从CSV文件中读取数据并插入到集合中。CSV文件大约有100,000行,30列(14MB大小)。
以下是用于插入数据的代码:
public static void ConnectAndInsert(string path, string myQueueItem)
{
string connectionString = "mongodb://localhost:27017/";
MongoClient client = new MongoClient(connectionString);
IMongoDatabase database = client.GetDatabase("csa");
// 连接到集合
if (!CollectionExists(database, myQueueItem))
{
var options = new CreateCollectionOptions()
{
// 配置选项
};
database.CreateCollection(myQueueItem, options);
}
IMongoCollection<BsonDocument> collection = database.GetCollection<BsonDocument>(myQueueItem);
// 用于测量时间的计时器
Stopwatch stopwatch = new Stopwatch();
Console.WriteLine("插入数据中...");
stopwatch.Start();
using (var reader = new StreamReader(path))
using (var csvReader = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture) { Delimiter = ";" }))
{
csvReader.Read();
csvReader.ReadHeader();
int offset = 0;
var data = new List<BsonDocument>();
var option = new InsertManyOptions() { IsOrdered = false };
while (csvReader.Read())
{
var document = new BsonDocument();
foreach (var header in csvReader.HeaderRecord)
{
document.Add(header, csvReader.GetField(header));
}
if (offset > 20000)
{
Stopwatch insertStopwatch = new Stopwatch();
insertStopwatch.Start();
collection.InsertMany(data, option);
insertStopwatch.Stop();
Console.WriteLine("插入时间:" + insertStopwatch.ElapsedMilliseconds + " 毫秒");
offset = 0;
data = new List<BsonDocument>();
}
else
{
data.Add(document);
offset++;
}
}
if (offset > 0)
{
collection.InsertMany(data, option);
}
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds + " 毫秒");
}
这个方法将由队列消息触发。
我一次性放入100条消息并运行应用程序(1)导入数据,这需要3000毫秒将数据逐个插入到MongoDB中。
然后我运行应用程序(2)导入数据。在运行两个应用程序导入数据后,每次插入数据到MongoDB都需要5000毫秒。
集合中大约有1000个具有相同数据的集合。
关于MongoDB的信息如下:
- MongoDB 6.0.6 社区版,部署在单台机器上,没有分片
- {
"name": 'wiredTiger',
"supportsCommittedReads": true,
"oldestRequiredTimestampForCrashRecovery": Timestamp({ t: 0, i: 0 }),
"supportsPendingDrops": true,
"dropPendingIdents": Long("0"),
"supportsSnapshotReadConcern": true,
"readOnly": false,
"persistent": true,
"backupCursorOpen": false
}
我的问题是,如果导入运行并行,如何减少插入时间?看起来应用程序(1)和应用程序(2)在插入之前都等待彼此完成,只有一个插入命令可以同时运行,而且在我的情况下时间翻倍。
从评论中我编辑了代码以仅测量插入代码的时间,我发现如果我运行1个进程,它需要60毫秒来插入,但如果我运行2个进程,它需要100毫秒来插入。
英文:
I have a VM that host in Azure which Standard D2ds v5 (2 CPU core, 8 Gb of RAM)
I create the console application that read data from CSV file and insert into Collection.
The CSV file is around 100k rows with 30 columns (14MB).
The code that use to insert data shown below
public static void ConnectAndInsert(string path, string myQueueItem)
{
string connectionString = "mongodb://localhost:27017/";
MongoClient client = new MongoClient(connectionString);
IMongoDatabase database = client.GetDatabase($"csa");
// connect to collection
if (!CollectionExists(database, myQueueItem))
{
var options = new CreateCollectionOptions()
{
};
database.CreateCollection(myQueueItem, options);
}
IMongoCollection<BsonDocument> collection = database.GetCollection<BsonDocument>(myQueueItem);
// stopwatch to measure time
Stopwatch stopwatch = new Stopwatch();
Console.WriteLine("Inserting data...");
stopwatch.Start();
using (var reader = new StreamReader(path))
using (var csvReader = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture) { Delimiter = ";" }))
{
csvReader.Read();
csvReader.ReadHeader();
int offset = 0;
var data = new List<BsonDocument>();
var option = new InsertManyOptions() { IsOrdered = false };
while (csvReader.Read())
{
var document = new BsonDocument();
foreach (var header in csvReader.HeaderRecord)
{
document.Add(header, csvReader.GetField(header));
}
if (offset > 20000)
{
Stopwatch insertStopwatch = new Stopwatch();
insertStopwatch.Start();
collection.InsertMany(data, option);
insertStopwatch.Stop();
Console.WriteLine("inserted time : " + insertStopwatch.ElapsedMilliseconds + " ms");
offset = 0;
data = new List<BsonDocument>();
}
else
{
data.Add(document);
offset++;
}
}
if (offset > 0)
{
collection.InsertMany(data, option);
}
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds + " ms");
}
The method will be triggered by queue message.
I drop 100 messages at a time and run the application(1) to import data
It takes 3000 ms. to insert data to MongoDB one by one
Then I run the application(2) to import data
After run 2 applications to import data. It takes 5000 ms. to insert data to MongoDB each time.
the collection is around 1000 collections with same data
the information about mongoDB is
MongoDB 6.0.6 community. Deploy in single machine without shard
{
name: 'wiredTiger',
supportsCommittedReads: true,
oldestRequiredTimestampForCrashRecovery: Timestamp({ t: 0, i: 0 }),
supportsPendingDrops: true,
dropPendingIdents: Long("0"),
supportsSnapshotReadConcern: true,
readOnly: false,
persistent: true,
backupCursorOpen: false
}
My question is How to reduce insert time if the import run parallel? It looks like the Application(1) and Application(2) wait each other to finish before insert, only one insert command can run at a time, and time double in my case.
From comment I have edited the code to measure the time only insert code I found that if I run 1 process it use 60ms to insert but if I run 2 processes it use 100ms to insert
答案1
得分: 0
MongoDB每个集合只有一个写入器,这个限制是为了保证你的操作具有原子性。但这也限制了你可以“并行”插入的数量。
无论你在应用程序中做什么,这个约束都保持不变,一个单一的写入器游标需要将数据写入集合。这也是为什么并行化你的代码不一定会使它运行更快的原因。
>我的问题是,如果导入运行并行,如何减少插入时间?看起来应用程序(1)和应用程序(2)在插入之前都要等待对方完成
现在我们理解了这些限制,让我们了解如何减少处理时间。
- 即使有单一写入器的限制,如果内存中有大量操作发生,例如,并行化进程可能仍然更快。是的,每个单独的插入会变慢,但总体而言,这两个进程可能会更快地完成。优化这一点纯粹是试错,以找到运行的并发进程数量的最佳值。
- 放弃多余的索引 - 索引是在插入时创建的,因此它们会减慢操作的速度,特别是当规模增大和/或索引非常复杂(多键/地理/等)时。如果你有任何不常用的多余索引,我建议删除它们,因为这将加快插入速度。
- 我相信你提到过你将这个数据库作为独立运行,但如果这个数据库是复制的,那么写入关注也可能是一个问题,特别是在高容量插入时,可能会出现复制滞后。根据你的需求,你可以关闭“majority”要求,但不应该轻率地这样做,因为它可能会带来数据完整性的成本。
- 共享集合 - 如果你将集合分片(可以是应用程序级别的分片 - 多个较小的集合,或者实际的Mongo分片),插入也会更加高效,这主要与第2点相同,因为索引大小会小得多。
- 最后是硬件改进,许多人在这个时代会忽视这一方面,但硬件限制是真实存在的,增加一些磁盘IOPS也可以帮助加快速度。我会说这应该被视为一种“最后的手段”行动,因为一般来说,磁盘IOPS相当昂贵。
英文:
MongoDB only has 1 writer per collection, this restraint is what gives you atomicity in your actions. But it also limits how much you can "parallel" your inserts.
No matter what you do in your application this constraint remains true, a single writer cursor needs to write the data into the collection. This is also why parallelising your code does not necessarily help it go faster.
>My question is How to reduce insert time if the import run parallel? It looks like the Application(1) and Application(2) wait each other to finish before insert
Now that we understand this constraints let's understand what can we do to reduce this processing time.
- Even with the single writer constraint Parallelising the processes might still be faster, if there are a lot of in memory actions happening for example. Yes each single insert will get slower but overall the 2 processes might finish faster. Optimising this is purely trial and error to find the sweet spot of the # of concurrent processes to run.
- Dropping redundant indexes - Indexes are created on insertion, so they slow down the operation quite a bit, especially when scale increases and/or if the indexes are very complex (multikey / geo / etc). If you have any redundant indexes you don't really use I recommend dropping them as this will speed up the inserts.
- I believe you mentioned you're running this DB as a standalone, but if this DB is replicated then write concern may also be an issue, especially with high volume inserts replication lag can occur. depending on your requirements you can turn off the "majority" requirement, however this should not be done recklessly as it can come with a data integrity cost.
- Sharing the collection - if you shard the collection (either application level shard - which is multiple smaller collections, or actual mongo sharding) inserts will also be much more efficient, this is mainly relevant for high scale for similar reasons as #2 as the index sizes are going to be much smaller.
- Finally hardware improvements, many people forget about this aspect at this day and age but hardware limitations are real, adding some disk IOPS could also help speed things along. I would say this should be used as a "last resort" kind of action as generally speaking disk IOPS are quite expensive.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论