英文:
How to partition a table by year and then subpartition by month in mysql 8
问题
我有一个包含 month
和 year
列的表。
通常我的查询看起来类似于 WHERE month=1 AND year=2022
。
鉴于表很大,我想使用分区和子分区来提高效率。
表1
查询所需的数据大约花了2分30秒。
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
)
按 "month" 进行分区
查询所需的数据大约花了21秒(有很大的改进)。
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (`month`)
(PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB) */
我想进一步提高性能,通过按年进行分区,然后按月进行子分区。怎么做?
我不确定 https://stackoverflow.com/questions/23802644/partition-by-year-and-sub-partition-by-month-mysql 这个问题是否相关,因为没有标记答案,而且那个问题似乎只适用于 MySQL 5* 和 PHP。我问的是关于 MySQL 8,是否有关于分区/子分区/列表列/范围列等方面的变化?这些变化可能会对我有所帮助。
我正在进行更广泛的查询
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12 AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4') # 通常只有4种类型,查询中通常都包含这4种类型
英文:
I have a table that contains a month
and a year
column.
I have a query which usually looks something like WHERE month=1 AND year=2022
Given how large this table is i would like to make it more efficient using partitions and sub partitions.
table 1
Querying the data i need took around 2 minutes and 30 seconds.
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
)
Partitioning by "month"
Querying the data i need took around 21 seconds (big improvement).
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (`month`)
(PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB) */
I would like to see if i can improve the performance even further by partitioning by year and then subpartitioning by month. How can i do that?
I'm not sure the following question https://stackoverflow.com/questions/23802644/partition-by-year-and-sub-partition-by-month-mysql is relevant with no marked answers and that question looks to be particular to mysql 5* and php. Im asking about mysql 8, are there no changes since then regarding partioning/subpartioning/list columns/range columns etc? which could help me.
Broader query im making
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12 AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4') # only ever 4 types usually all 4 are present in the query
答案1
得分: 2
-
将日期拆分为列通常是低效的。在
SELECT
期间拆分要容易得多。 -
分区
通常对任何SELECT
的性能没有帮助。 -
在分区(或取消分区)时,通常需要更改索引。
对于该查询,我建议使用一个合并的日期
列,
WHERE date >= '2022-01-01'
AND date < '2022-01-01' + INTERVAL 1 MONTH
以及一些以date
开头的INDEX
。
(您可能还有其他查询;让我们看看其中一些;它们可能需要不同的索引。)
覆盖索引 - 这是一个包含SELECT
中任何地方找到的所有列的索引。它可能比仅包含WHERE
或WHERE
+ GROUP BY
+ ORDER BY
所需的列更好(更快)。这取决于许多变量。
索引(或PK)中的列的顺序:左侧的列具有优先级。这是磁盘上索引行的顺序。如果只是按日期搜索,那么PK(id,date)是有用的,但如果你只是按日期搜索,就不是了。
可搜索的 - sargable - 将列隐藏在函数中会禁用索引的使用。即MONTH(date)
不能使用INDEX(date)
。
博客 - Index Cookbook 和 Partition
测试计划
我建议您对各种Create Tables的查询进行计时。
对于WHERE
子句:
ANDs
的顺序无关紧要。- 使用
IN
时,单个值等效于=
,并且优化得更好。多个值可能优化得更差。正如Bill所暗示的,当IN列表包含所有选项时,应该消除该子句,因为优化器不够聪明。因此,请确保进行1项和/或多项测试,以使您的应用程序更加真实。
对于表
- 尝试按年分区 + 按月分区。
- 尝试按年和月的组合分区。
- 尝试不进行分区。
对于索引
- 列的顺序(在复合索引中)确实重要,因此请尝试不同的排序。
- 在分区时,请确保将分区键(s)添加到PK的末尾。
- 分区表需要与非分区表不同的索引。也就是说,对于一个表很好的东西对于另一个表可能效果不佳。
只需使用以下模式来测试各种布局:
CREATE TABLE ((使用或不使用分区以及索引的新布局))
INSERT INTO test_table SELECT ... FROM real_table;
将“...”更改为test_table中的任何额外/缺少的列
SELECT ...
运行各种'真实'查询
每个查询运行两次(缓存有时会影响计时)
报告结果 - 如果您提供足够的信息(CREATE TABLE和SELECT),我可能会建议进一步加快测试(无论是否分区)。
英文:
-
Splitting a date into columns is usually counterproductive. It is much easier to split during
SELECT
. -
PARTITIONing
is usually useless for performance of anySELECT
. -
When partitioning (or unpartitioning), the indexes usually need changing.
For that query, I recommend a combined date
column,
WHERE date >= '2022-01-01'
AND date < '2022-01-01' + INTERVAL 1 MONTH
and some INDEX
starting with date
.
(You probably have other queries; let's see some of them; they may need a different index.)
Covering index -- This is an index that contains all the columns found anywhere in the SELECT
. It is may be better (faster) than having only the columns needed for WHERE
or WHERE
+ GROUP BY
+ ORDER BY
. It depends on a lot of variables.
Order of columns in an index (or PK): The leftmost column(s) have priority. That is the order of the index rows on disk. PK(id, date) is useful if looking up by id
(in the WHERE
), but not if you are just searching by date.
Sargable -- sargable -- Hiding a column in a function disables the use of an index. That is MONTH(date)
cannot use INDEX(date)
.
Blogs -- Index Cookbook and Partition
Test plan
I recommend you time all your queries against a variety of Create Tables.
For the WHERE
clause:
- The order of
ANDs
does not matter. - When using
IN
, a single value os equivalent to=
and optimizes better. Multiple values may optimize more poorly. As Bill hints at, when the IN list contains all the options, you should eliminate the clause since the Optimizer is not smart enough. So, be sure to test with 1 and/or many items, so as to be realistic to your app.
For the table
- Try Partition BY year + Subpartition by month.
- Try Partition by a column that is the combination of year and month.
- Try without partitioning.
For indexes
- Order of the columns (in a composite index) does matter, so try different orderings.
- When partitioning, be sure to tack onto the end of the PK the partition key(s).
- A partitioned table needs different indexes than a non-partitioned table. That is, what works well for one may work poorly for the other.
Simply use something like this pattern to test various layouts:
CREATE TABLE (( a new layout with or without partitioning and with indexes ))
INSERT INTO test_table SELECT ... FROM real_table;
Change the "..." to adapt to any extra/missing columns in test_table
SELECT ...
Run various 'real' queries
Run each query twice (caching sometimes messes with the timing)
Report the results -- If you provide sufficient info (CREATE TABLE and SELECT), I may have suggestions on further speeding up the test (whether it is partitioned or not).
答案2
得分: 2
为了直接回答你的问题,以下是执行子分区操作的示例语法。请注意,主键必须包括用于分区或子分区的所有列。有关更多信息,请参阅子分区的手册:https://dev.mysql.com/doc/refman/8.0/en/partitioning-subpartitions.html
架构(MySQL v8.0)
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`, `year`),
KEY `idx_month_year` (`month`,`year`, `score`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY LIST (`month`)
SUBPARTITION BY HASH(`year`)
SUBPARTITIONS 10 (
PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB
);
通过使用EXPLAIN
来分析你的查询,可以看到查询仅涉及一个子分区。
查询 #1
EXPLAIN
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12
AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4');
EXPLAIN
的partitions
字段显示它只访问分区p12_p12sp2
。查询引用的年份是2022,取模子分区数(10),因此从子分区2读取数据。
除了按月和年分区外,使用索引也是有帮助的。在这种情况下,我将score
添加到索引中,以过滤掉score <= 0
的行。EXPLAIN
中的注释"Using index condition"显示它将进一步筛选entity_type
的条件委托给存储引擎。尽管在你的示例中,你说只有四个实体类型的值,且都被选择,因此该条件不会过滤掉任何行。
有关更多信息,请查看 DB Fiddle。
关于你在下面的评论中的问题:
-
关于
SUBPARTITIONS 10
,为什么选择10:这只是一个示例。你可以选择不同数量的子分区,具体取决于减少搜索所需的数量。实际上,我从未遇到需要子分区的情况,如果查询也使用了索引进行优化。因此,我对什么是适当数量的子分区没有具体的建议。你需要进行性能测试,直到满意为止。 -
关于分区名称
p12_p12sp2
,如何知道它选择了年份2022的分区:查询中有一个条件year = 2022
。我的示例中有10个子分区。哈希分区只是使用要分区的整数值,对子分区数取模。2022对10取模等于2。因此使用了以...sp2
结尾的分区。 -
与另一个博客文章的不同:在那篇博客文章中,他们选择为子分区命名。这是不必要的。
-
使用单个日期(例如2022-12-21)而不是单独的月和年列,是否会有性能差异:这取决于查询,我将其留给你进行测试。我对你的数据和服务器上的性能没有准确的预测。
-
为什么我选择按月分区和按年子分区,而不是按年分区和按月子分区:子分区仅在外部分区为LIST或RANGE分区,子分区为HASH或KEY分区时才起作用,这在我链接的手册页面中有说明。月份有限(12个),这使得按LIST方式分区变得容易,你永远不需要更多的分区。如果你以年为外部分区,你将需要在列表中指定年份值,而这是一个不断增长的集合,因此你将不时需要更改表以扩展列表或范围以适应新年份。而当以HASH方式分区子分区时,新的年份值映射到有限的子分区集中,因此不需要有限列表。你不必更改表来重新分区(除非你想更改子分区的数量)。
英文:
To answer your question directly, below is example syntax that accomplishes the subpartitioning. Notice the PRIMARY KEY must include all columns used for partitioning or subpartitioning. Read the manual on subpartitioning for more information: https://dev.mysql.com/doc/refman/8.0/en/partitioning-subpartitions.html
Schema (MySQL v8.0)
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`, `year`),
KEY `idx_month_year` (`month`,`year`, `score`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY LIST (`month`)
SUBPARTITION BY HASH(`year`)
SUBPARTITIONS 10 (
PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB
);
Using EXPLAIN on your query reveals that the query references only one subpartition.
Query #1
EXPLAIN
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12
AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4');
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | table_1 | p12_p12sp2 | range | idx_month_year | idx_month_year | 11 | 1 | 100 | Using index condition |
The partitions
field of the EXPLAIN shows that it accesses only partition p12_p12sp2
. The year the query references, 2022, modulus the number of subpartitions, 10, will read from the subpartition 2.
In addition to the partitioning by month and year, it is also helpful to use an index. In this case, I added score
to the index so it would filter out rows where score <= 0
. The note in the EXPLAIN "Using index condition" shows that it is delegating further filtering on entity_type to the storage engine. Though in your example, you said there are only four values for entity type, and all four are selected, so that condition won't filter out any rows anyway.
Re your questions in comments below:
> a little bit confused on SUBPARTITIONS 10 , why 10
It's just an example. You can choose a different number of subpartitions. Whatever you feel is required to reduce the search as much as you want.
To be honest, I've never encountered a situation that required subpartitioning at all, if the search is also optimized with indexes. So I have no guidance on what is an appropriate number of subpartitions.
It's your responsibility to test performance until you are satisfied.
> also bit confusd on the partition name p12_p12sp2 how do i know it selected the partition with year 2022 from looking at that?
The query has a condition year = 2022
.
There are 10 subpartitions in my example.
Hash partitioning just uses the integer value to be partitioned, modulus the number of partitions.
2022 modulus 10 is 2. Hence the partition ending in ...sp2
is the one used.
> I also came across this anothermysqldba.blogspot.com/2014/12/… do you know how yours differs from what it shown here ( bare in mind that blog is from 2014)
They chose to name the subpartitions. There's no need to do that.
> would there be any performance difference in having a single date e.g (2022-12-21) instead of sepreate columns month and year.
That depends on the query, and I'll leave it to you to test. Any predictions I make won't be accurate with your data on your server.
> i can also see that you partition by month and subpartition by year, as oppose to partition by year and subpartition by month. can you explain the reasoning?
Subpartitioning works only if the outer partitions are LIST or RANGE partitions, and the subpartitions are HASH or KEY partitions. This is in the manual page I linked to.
There are a finite number of months (12). This makes it easy to partition by LIST as you did. You won't ever need more partitions. If you had partitioned by YEAR as the outer partition, you would have needed to specify year values in the list, and this is a growing set, so you would periodically have to alter the table to extend the list or range to account for new years.
Whereas when partitioning by HASH for the subpartitioning, the new year values are mapped into the finite set of subpartitions, so it's okay that it's not a finite list. You won't have to alter table to repartition (unless you want to change the number of subpartitions).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论