如何获取在一个包含4百万条记录的表中,在过去的5分钟内创建的记录?

huangapple go评论56阅读模式
英文:

How to get the records created in last 5 minutes within a 4m records table?

问题

我有一个MySQL表activity_logs,它保存了自应用程序启动以来客户端应用程序所做的所有API调用,应用程序已经运行了几年,这个表现在有450万条记录。

值得注意的是,每条记录都有一个增量ID,而且created_at字段在记录创建后不会被更新,所以created_at的值基本上是像id字段一样递增的。

  • id是主键
  • created_at没有建立索引

今天我注意到了一个性能问题,当我需要查询最近5分钟内创建的记录时,我做了一些简单的事情,比如在我的PHP代码中写了where("created_at", ">=", now()->subMinutes(5)),不久后发现这个查询已经搜索了全部的450万条记录,并且非常慢。然而,这是有道理的,因为数据库引擎不知道在我的情况下created_at始终是递增的。

在盲目地向created_at添加索引来解决问题之前,我停下来考虑了一下,created_at的值如此多样化,所以给它添加索引似乎不太合适。最终,我采用了两个单独的查询。

  • SELECT id FROM activity_logs WHERE created_at < DATE_SUB(NOW(), INTERVAL 5 MINUTE) ORDER BY id DESC LIMIT 1
  • SELECT * FROM activity_logs WHERE id > [查询结果]

这个方法很有效,但我想知道处理这个问题的__标准__方法是什么,因为我认为这是一个非常常见的问题。我喜欢PHP/Laravel和ORM,没有过多关注SQL话题,希望能听到您的见解。

英文:

I have a MySQL table activity_logs which saves all API calls made by client apps since the very beginning, the app has been launched a couple of years, and this table now has 4.5 million records.

It worths to notice that each record will have an incremental ID, and also the created_at field won't be updated after the record is created, so the values of created_at is basically incremental in the same way as id field.

  • id is the primary key
  • created_at is not indexed

I noticed a performance issue today when I need to query the records created in last 5 minutes, I did something simple like where(&quot;created_at&quot;, &quot;&gt;=&quot;, now()-&gt;subMinutes(5)) in my PHP code, and shortly notice this query had searched all 4.5 million records and was terribly slow. However, it makes sense because database engine doesn't know that created_at is always incremental in my case.

I paused before blindly adding an index to created_at to fix it, the value of created_at is so diverse so adding an index to it doesn't sound to be right. I ended up with two separate queries.

  • SELECT id FROM activity_logs WHERE created_at &lt; DATE_SUB(NOW(), INTERVAL 5 MINUTE) ORDER BY id DESC LIMIT 1
  • SELECT * FROM activity_logs WHERE id &gt; [Query Result]

It works well, but I am wondering what's the standard way to deal with this issue, as I believe it is a very common issue. I am in love with PHP/Laravel and ORM and didn't follow up on SQL topics too much, and would like to see your insights.

答案1

得分: 1

添加INDEX(created_at),然后使用

WHERE created_at > NOW() - INTERVAL 5 MINUTE
ORDER BY created_at, id

更多细节:

  • 我假设(希望)你在那张表上有 ENGINE=InnoDB
  • 如果你有 PRIMARY KEY(id),那么它是基于 id 聚集的,MySQL 无法 提供另一个聚集索引。
  • 上面建议的二级索引在末尾默默地附加了 id。(这也是它如何找到行的方式。)
  • 我建议的代码将会很快。
  • MySQL 没有 CLUSTERED 关键字。主键 聚集的, 唯一的, 一个索引。(MySQL 是 '精简而高效' 的;它没有像其他供应商那样多的花哨功能,因为它们不值得实施和维护。)
  • 一个类似的技巧(但在这里并不是真正必要的)是
PRIMARY KEY(created_at, id),  -- 为了获得你想要的聚集效果
INDEX(id)    -- 为了让 AUTO_INCREMENT 开心
英文:

Add INDEX(created_at), then use

WHERE created_at &gt; NOW() - INTERVAL 5 MINUTE
ORDER BY created_at, id

More details:

  • I assume (hope) that you have ENGINE=InnoDB on that table

  • If you have PRIMARY KEY(id), then it is clustered on id and MySQL cannot provide another cluster index.

  • The secondary index suggested above has id silently tacked on the end. (That is also how it finds the row.)

  • The code I suggest will be fast.

  • MySQL has no CLUSTERED keyword. The PK is clustered and is Unique and is an index. (MySQL is 'lean an mean'; it does not have as many bells and whistles as other vendors because they aren't worth the effort to implement and maintain.)

  • A similar trick (but not really necessary here) is

      PRIMARY KEY(created_at, id),  -- To get the clustering you desire
      INDEX(id)    -- to keep AUTO_INCREMENT happy
    

huangapple
  • 本文由 发表于 2023年7月27日 23:07:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76781115.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定