2023年5月21日 15:27:57go评论129阅读模式

英文:

Is there a better way to compare huge dataset supplied by the user to the entries in the database?

问题

我有500万条记录在一个MySQL数据库中，结构如下：

id（主键），姓名，电话号码
1，John Doe，12346789

......

我的工具持续地从互联网上抓取数据，捕获了数百万条新数据。现在，这些抓取到的数据会被传递到processDataInChunks函数中，每次处理10000条记录。

我试图比较抓取到的数据和数据库中的行，并在数据库中不存在的情况下插入它。为了实现这个目的，我在Node.js中使用了Sequelize，并且这是代码的实现：

// data = 爬虫收集到的数据块
// callback = 当检测到新条目时调用的函数
function processDataInChunks(data, callback) {
   data.map(function (entry) { // 循环遍历数据数组，其中entry是数组的第n个元素
      db.findAll({phone_number: entry.phone_number}) // db变量代表SQL表
         .then(function (rows) { // 当查询成功时，使用rows作为参数调用该函数
            if (!rows.length &gt; 0) { // 如果电话号码不在数据库中
               db.create({ // 在数据库中创建条目
                  name: entry.name,
                  phone_number: entry.phone_number
               }).then(function () {
                  callback(entry);
                  console.log(`发现新的电话号码：${entry}`)
               }).catch(err=&gt;console.log(err))

            }
         }).catch(err=&gt;console.log(err))
   })
}

然而，在运行代码时，我遇到了ConnectionAcquireTimeoutError错误。我认为这个错误是因为所有的连接都被消耗掉了，Sequelize没有更多的连接可以提供来执行新的查询。有什么最好和最快的方法来执行这个操作吗？请帮忙。

我尝试使用了async/await，但是仍然非常慢，而且我认为它永远不会完成。

英文:

I have 5 million entries in a MySQL database with the following structure

id (primary_key), name, phone_number
1, John Doe, 12346789

.....

My tool scrapes the internet continuously, grabs the data and captures millions of new data. Now that scraped data is passed to the processDataInChunks function with 10000 entries at once.

I am trying to compare the scraped data to the rows in the database and insert it if it doesn't exists. To implement it I have used sequelize in Node.js and here's the code implementation:

// data = chunk of data collected by the scraper
// callback = function to call when new entry is detected
function processDataInChunks(data, callback) {
   data.map(function (entry) { // loop through the data array where entry is nth element of the array
      db.findAll({phone_number: entry.phone_number}) // db variable represents the SQL table
         .then(function (rows) { // call this function with rows as argument when the query is successful
            if (!rows.length &gt; 0) { // if phone number is not in the database
               db.create({ // create entry in the database
                  name: entry.name,
                  phone_number: entry.phone_number
               }).then(function () {
                  callback(entry);
                  console.log(`Found a new phone number: ${entry}`)
               }).catch(err=&gt;console.log(err))

            }
         }).catch(err=&gt;console.log(err))
   })
}

While, running the code I'm getting ConnectionAcquireTimeoutError error. I am assuming that the error is being encountered because all the connections are being consumed and sequelize doesn't have any more to provide to perform new queries. What is the best and fastest way to perform the operation? Please help.

I have tried using async/await but still it takes ages and I don't think it will ever complete.

答案1

得分: 0

你似乎每次接收一行数据就执行至少两条SQL语句。通过真正进行批处理，您可以获得10倍的加速。

使用INSERT INTO ... ON DUPLICATE KEY UPDATE ...（也称为IODKU或upsert）。这样可以避免执行初始选择操作。这能提速2倍。

将它们分批处理，每批处理1000条数据，10000条可能会稍微更快，但可能会遇到其他问题。

大部分条目都没有改变吗？还是说您获取的所有数据都是要么是新条目要么是更新？如果数据往往是单向的还是另一种情况？（IODKU能够处理这三种情况。）

数据中提供了id吗？它是否一致？还是说name是要更新的行的实际线索？在这种情况下，您有哪个索引？是否可能有两个不同的人有相同的名字？如果是这样，您如何区分他们的行？

您可以将这些数据块提供给多个线程。这将提供一些并行性。大约使用CPU核心数作为停止点。确保数据块大小控制在1000以内；10000可能会有锁定问题，甚至可能导致死锁。并确保检查错误。

有可能编写一条IODKU语句处理一千行。或者您可以将所有数据都放入临时表中，然后从中处理。

如果您一次只收到一行数据，请详细说明；可能需要额外的步骤来收集数据。

总结—

批量处理语句可提速10倍
IODKU可提速2倍
并行处理可提速5倍（？）

总计可能是100倍。这听起来好吗？

英文:

You seem to be issuing at least two SQL statements per incoming row. By really batching, you can get 10x speedup.

Use INSERT INTO ... ON DUPLICATE KEY UPDATE ... (aka IODKU or upsert). It avoids having to do the initial select. That speedup of 2x.

Batch them in clumps of 1000 -- 10000 might be slightly faster, but may run into other issues.

Are most of the entries unchanged? Or is everything you get going to be either a new entry or an update? There may be some extra optimizations if the data tends to be one way versus the other. (IODKU is happy to handle all 3 cases.)

Does the data provide the id? Is it consistent? Or is the name the actual clue of which row to update? In this case, what index do you have? Can ther be two different people with the same name? If so, how would you differentiate their rows?

You could feed the chunks to multiple threads. This would provide some parallelism. Stop at about the number of CPU cores. And do keep the chunk size down at 1000; 10000 is likely to have locking issues, maybe even deadlocks. And do check for errors.

It is possible to write a single IODKU that handles a thousand rows. Or you could throw all the data into an temp table and work from there.

If you receive only one row at a time, please spell out the details; an extra step will be needed to collect the data.

Summary --

10x for batching statements
2x for IODKU
5x(?) for parallelism

Total might be 100x. Does that sound better?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有更好的方法来比较用户提供的大型数据集与数据库中的条目？

问题

答案1

SQL Server从非规范化表迁移到规范化表导致性能问题。

如何修复Azure应用服务在端口80上出现错误？

无法连接 MongoDB 到 Node.js

如何在发出 API 调用之前计算标记数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论