有没有更好的方法来比较用户提供的大型数据集与数据库中的条目?

huangapple go评论62阅读模式
英文:

Is there a better way to compare huge dataset supplied by the user to the entries in the database?

问题

我有500万条记录在一个MySQL数据库中,结构如下:

id(主键),姓名,电话号码
1,John Doe,12346789

......

我的工具持续地从互联网上抓取数据,捕获了数百万条新数据。现在,这些抓取到的数据会被传递到processDataInChunks函数中,每次处理10000条记录。

我试图比较抓取到的数据和数据库中的行,并在数据库中不存在的情况下插入它。为了实现这个目的,我在Node.js中使用了Sequelize,并且这是代码的实现:

// data = 爬虫收集到的数据块
// callback = 当检测到新条目时调用的函数
function processDataInChunks(data, callback) {
   data.map(function (entry) { // 循环遍历数据数组,其中entry是数组的第n个元素
      db.findAll({phone_number: entry.phone_number}) // db变量代表SQL表
         .then(function (rows) { // 当查询成功时,使用rows作为参数调用该函数
            if (!rows.length > 0) { // 如果电话号码不在数据库中
               db.create({ // 在数据库中创建条目
                  name: entry.name,
                  phone_number: entry.phone_number
               }).then(function () {
                  callback(entry);
                  console.log(`发现新的电话号码:${entry}`)
               }).catch(err=>console.log(err))

            }
         }).catch(err=>console.log(err))
   })
}

然而,在运行代码时,我遇到了ConnectionAcquireTimeoutError错误。我认为这个错误是因为所有的连接都被消耗掉了,Sequelize没有更多的连接可以提供来执行新的查询。有什么最好和最快的方法来执行这个操作吗?请帮忙。

我尝试使用了async/await,但是仍然非常慢,而且我认为它永远不会完成。

英文:

I have 5 million entries in a MySQL database with the following structure

id (primary_key), name, phone_number
1, John Doe, 12346789

.....

My tool scrapes the internet continuously, grabs the data and captures millions of new data. Now that scraped data is passed to the processDataInChunks function with 10000 entries at once.

I am trying to compare the scraped data to the rows in the database and insert it if it doesn't exists. To implement it I have used sequelize in Node.js and here's the code implementation:

// data = chunk of data collected by the scraper
// callback = function to call when new entry is detected
function processDataInChunks(data, callback) {
   data.map(function (entry) { // loop through the data array where entry is nth element of the array
      db.findAll({phone_number: entry.phone_number}) // db variable represents the SQL table
         .then(function (rows) { // call this function with rows as argument when the query is successful
            if (!rows.length > 0) { // if phone number is not in the database
               db.create({ // create entry in the database
                  name: entry.name,
                  phone_number: entry.phone_number
               }).then(function () {
                  callback(entry);
                  console.log(`Found a new phone number: ${entry}`)
               }).catch(err=>console.log(err))

            }
         }).catch(err=>console.log(err))
   })
}

While, running the code I'm getting ConnectionAcquireTimeoutError error. I am assuming that the error is being encountered because all the connections are being consumed and sequelize doesn't have any more to provide to perform new queries. What is the best and fastest way to perform the operation? Please help.

I have tried using async/await but still it takes ages and I don't think it will ever complete.

答案1

得分: 0

你似乎每次接收一行数据就执行至少两条SQL语句。通过真正进行批处理,您可以获得10倍的加速。

使用INSERT INTO ... ON DUPLICATE KEY UPDATE ...(也称为IODKU或upsert)。这样可以避免执行初始选择操作。这能提速2倍。

将它们分批处理,每批处理1000条数据,10000条可能会稍微更快,但可能会遇到其他问题。

大部分条目都没有改变吗?还是说您获取的所有数据都是要么是新条目要么是更新?如果数据往往是单向的还是另一种情况?(IODKU能够处理这三种情况。)

数据中提供了id吗?它是否一致?还是说name是要更新的行的实际线索?在这种情况下,您有哪个索引?是否可能有两个不同的人有相同的名字?如果是这样,您如何区分他们的行?

您可以将这些数据块提供给多个线程。这将提供一些并行性。大约使用CPU核心数作为停止点。确保数据块大小控制在1000以内;10000可能会有锁定问题,甚至可能导致死锁。并确保检查错误。

有可能编写一条IODKU语句处理一千行。或者您可以将所有数据都放入临时表中,然后从中处理。

如果您一次只收到一行数据,请详细说明;可能需要额外的步骤来收集数据。

总结—

  • 批量处理语句可提速10倍
  • IODKU可提速2倍
  • 并行处理可提速5倍(?)

总计可能是100倍。这听起来好吗?

英文:

You seem to be issuing at least two SQL statements per incoming row. By really batching, you can get 10x speedup.

Use INSERT INTO ... ON DUPLICATE KEY UPDATE ... (aka IODKU or upsert). It avoids having to do the initial select. That speedup of 2x.

Batch them in clumps of 1000 -- 10000 might be slightly faster, but may run into other issues.

Are most of the entries unchanged? Or is everything you get going to be either a new entry or an update? There may be some extra optimizations if the data tends to be one way versus the other. (IODKU is happy to handle all 3 cases.)

Does the data provide the id? Is it consistent? Or is the name the actual clue of which row to update? In this case, what index do you have? Can ther be two different people with the same name? If so, how would you differentiate their rows?

You could feed the chunks to multiple threads. This would provide some parallelism. Stop at about the number of CPU cores. And do keep the chunk size down at 1000; 10000 is likely to have locking issues, maybe even deadlocks. And do check for errors.

It is possible to write a single IODKU that handles a thousand rows. Or you could throw all the data into an temp table and work from there.

If you receive only one row at a time, please spell out the details; an extra step will be needed to collect the data.

Summary --

  • 10x for batching statements
  • 2x for IODKU
  • 5x(?) for parallelism

Total might be 100x. Does that sound better?

huangapple
  • 本文由 发表于 2023年5月21日 15:27:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76298754.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定