英文:
Why is Hibernate splitting my batch insert into 3 queries
问题
我目前正在尝试使用Hibernate实现批量插入。以下是我实施的几个方面:
1. 实体类
@Entity
@Table(name = "my_bean_table")
@Data
public class MyBean {
@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seqGen")
@SequenceGenerator(name = "seqGen", sequenceName = "bean_c_seq", allocationSize=50)
@Column(name = "my_bean_id")
private Long id;
@Column(name = "my_bean_name")
private String name;
@Column(name = "my_bean_age")
private int age;
public MyBean(String name, int age) {
this.name = name;
this.age = age;
}
}
2. application.properties
Hibernate和数据源配置如下:
spring.datasource.url=jdbc:postgresql://{ip}:{port}/${db}?reWriteBatchedInserts=true&loggerLevel=TRACE&loggerFile=pgjdbc.log
spring.jpa.show-sql=true
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
注意:&loggerLevel=TRACE&loggerFile=pgjdbc.log
用于调试目的。
3. 数据库中的元素
CREATE TABLE my_bean_table
(
my_bean_id bigint NOT NULL DEFAULT nextval('my_bean_seq'::regclass),
my_bean_name "char(100)" NOT NULL,
my_bean_age smallint NOT NULL,
CONSTRAINT bean_c_table_pkey PRIMARY KEY (bean_c_id)
)
CREATE SEQUENCE my_bean_seq
INCREMENT 50
START 1
MINVALUE 1
MAXVALUE 9223372036854775807
CACHE 1;
编辑:添加ItemWriter
public class MyBeanWriter implements ItemWriter<MyBean> {
private Logger logger = LoggerFactory.getLogger(MyBeanWriter.class);
@Autowired
MyBeanRepository repository;
@Override
public void write(List<? extends MyBean> items) throws Exception {
repository.saveAll(items);
}
}
commit-interval 也被设置为 50。
在jdbc驱动提供的日志文件中,我得到了以下行:
avr. 10, 2020 7:26:48 PM org.postgresql.core.v3.QueryExecutorImpl execute
FINEST: batch execute 3 queries, handler=org.postgresql.jdbc.BatchResultHandler@1317ac2c, maxRows=0, fetchSize=0, flags=5
...
FINEST: FE=> Parse(stmt=null,query="insert into my_bean_table (my_bean_age, my_bean_name, my_bean_id) values ($1, $2, $3),($4, $5, $6),($7, $8, $9), ... (truncated for brevity) ... ($91, $92, $93),($94, $95, $96)",oids={23,1043,20,23,1043,20, ... (truncated for brevity) ... 23,1043,20,23,1043,20})
...
我的问题是:为什么批量查询会分成3个查询:
- 第一个查询:32个元素
- 第二个查询:16个元素
- 第三个查询:2个元素
注意:我尝试将批量大小设置为100和200,仍然得到3个不同的查询。
英文:
I'm currently tring to implement a batch insert using Hibernate. Here are the few things I implemented :
1. Entity
@Entity
@Table(name = "my_bean_table")
@Data
public class MyBean {
@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seqGen")
@SequenceGenerator(name = "seqGen", sequenceName = "bean_c_seq", allocationSize=50)
@Column(name = "my_bean_id")
private Long id;
@Column(name = "my_bean_name")
private String name;
@Column(name = "my_bean_age")
private int age;
public MyBean(String name, int age) {
this.name = name;
this.age = age;
}
}
2.application.properties
Hibernate and the datasource are configured this way :
spring.datasource.url=jdbc:postgresql://{ip}:{port}/${db}?reWriteBatchedInserts=true&loggerLevel=TRACE&loggerFile=pgjdbc.log
spring.jpa.show-sql=truespring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
NB : &loggerLevel=TRACE&loggerFile=pgjdbc.log
is for debugging purpose
3. Elements in my PostgresSQL Database
CREATE TABLE my_bean_table
(
my_bean_id bigint NOT NULL DEFAULT nextval('my_bean_seq'::regclass),
my_bean_name "char(100)" NOT NULL,
my_bean_age smallint NOT NULL,
CONSTRAINT bean_c_table_pkey PRIMARY KEY (bean_c_id)
)
CREATE SEQUENCE my_bean_seq
INCREMENT 50
START 1
MINVALUE 1
MAXVALUE 9223372036854775807
CACHE 1;
EDIT : Added ItemWriter
public class MyBeanWriter implements ItemWriter<MyBean> {
private Logger logger = LoggerFactory.getLogger(MyBeanWriter .class);
@Autowired
MyBeanRepository repository;
@Override
public void write(List<? extends BeanFluxC> items) throws Exception {
repository.saveAll(items);
}
}
commit-interval is set to 50 as well.
In the log file provided by the jdbc driver I get the following lines :
avr. 10, 2020 7:26:48 PM org.postgresql.core.v3.QueryExecutorImpl execute
FINEST: batch execute 3 queries, handler=org.postgresql.jdbc.BatchResultHandler@1317ac2c, maxRows=0, fetchSize=0, flags=5
avr. 10, 2020 7:26:48 PM org.postgresql.core.v3.QueryExecutorImpl sendParse
FINEST: FE=> Parse(stmt=null,query="insert into my_bean_table (my_bean_age, my_bean_name, my_bean_id) values ($1, $2, $3),($4, $5, $6),($7, $8, $9),($10, $11, $12),($13, $14, $15),($16, $17, $18),($19, $20, $21),($22, $23, $24),($25, $26, $27),($28, $29, $30),($31, $32, $33),($34, $35, $36),($37, $38, $39),($40, $41, $42),($43, $44, $45),($46, $47, $48),($49, $50, $51),($52, $53, $54),($55, $56, $57),($58, $59, $60),($61, $62, $63),($64, $65, $66),($67, $68, $69),($70, $71, $72),($73, $74, $75),($76, $77, $78),($79, $80, $81),($82, $83, $84),($85, $86, $87),($88, $89, $90),($91, $92, $93),($94, $95, $96)",oids={23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20})
...
FINEST: FE=> Execute(portal=null,limit=1)
avr. 10, 2020 7:26:48 PM org.postgresql.core.v3.QueryExecutorImpl sendParse
FINEST: FE=> Parse(stmt=null,query="insert into my_bean_table (my_bean_age, my_bean_name, my_bean_id) values ($1, $2, $3),($4, $5, $6),($7, $8, $9),($10, $11, $12),($13, $14, $15),($16, $17, $18),($19, $20, $21),($22, $23, $24),($25, $26, $27),($28, $29, $30),($31, $32, $33),($34, $35, $36),($37, $38, $39),($40, $41, $42),($43, $44, $45),($46, $47, $48)",oids={23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20,23,1043,20})
...
avr. 10, 2020 7:26:48 PM org.postgresql.core.v3.QueryExecutorImpl sendParse
FINEST: FE=> Parse(stmt=null,query="insert into my_bean_table (my_bean_age, my_bean_name, my_bean_id) values ($1, $2, $3),($4, $5, $6)",oids={23,1043,20,23,1043,20})
Here is my question : why is the batch query splitted into 3 queries :
- first query : 32 elements
- second query : 16 elements
- third query : 2 elements
NB : I tried to set the batch size to 100 and 200, and I still got 3 different queries.
答案1
得分: 2
在调试过程中发现了PgPreparedStatement类及其transformQueriesAndParameters()
方法:
@Override
protected void transformQueriesAndParameters() throws SQLException {
...
BatchedQuery originalQuery = (BatchedQuery) preparedQuery.query;
// 单个查询的绑定不能超过{@link Short#MAX_VALUE},因此多值块的数量应受限制。
// 通常,批量处理超过128行没有太多意义:在使用一个多值语句更新128个语句后,性能不会有太大提升,
// 因此我们限制最大批量大小并在此处分割。
...
final int highestBlockCount = 128;
final int maxValueBlocks = bindCount == 0 ? 1024 /* 如果没有绑定,则使用1024行 */
: Integer.highestOneBit( // deriveForMultiBatch仅支持2的幂次
Math.min(Math.max(1, (Short.MAX_VALUE - 1) / bindCount), highestBlockCount));
}
- 一个批插入的单个查询最多只能包含128个元素
- 其他行数将为2的幂次
我现在在数据库中使用128作为序列增量,并在客户端端使用作为批量大小参数,效果非常好。
英文:
Found while debugging the PgPreparedStatement class and its transformQueriesAndParameters()
method :
@Override
protected void transformQueriesAndParameters() throws SQLException {
...
BatchedQuery originalQuery = (BatchedQuery) preparedQuery.query;
// Single query cannot have more than {@link Short#MAX_VALUE} binds, thus
// the number of multi-values blocks should be capped.
// Typically, it does not make much sense to batch more than 128 rows: performance
// does not improve much after updating 128 statements with 1 multi-valued one, thus
// we cap maximum batch size and split there.
...
final int highestBlockCount = 128;
final int maxValueBlocks = bindCount == 0 ? 1024 /* if no binds, use 1024 rows */
: Integer.highestOneBit( // deriveForMultiBatch supports powers of two only
Math.min(Math.max(1, (Short.MAX_VALUE - 1) / bindCount), highestBlockCount));
}
- a single query for a batch insert can only contain at max 128 elements
- other number of rows will be power of two
I'm now using 128 as sequence increment in the database and as batch-size parameter on the client side, it works like a charm.
答案2
得分: 0
我没有最终的答案,但这种行为似乎与批量获取的原因非常相似,可能是出于相同的原因。
它使用不同的语句,参数集数量为2的幂次。这是为了最小化执行不同语句的次数。数据库需要解析语句并使用缓存来保存已解析的语句。如果客户端执行了大量本质相同但参数集数量不同的语句,这将使缓存变得无用。
另一方面,我只在批量获取操作中看到过,而没有在批量插入操作中看到过。我有一些猜测可能是发生了以下情况:
-
您的 ID 是由数据库生成的,因此在数据可以写入数据库之前,需要从数据库序列中查询 ID。也许选择行为会泄漏到插入操作中。
-
这可能是 JDBC 驱动程序进行的优化,正在重写这种类型的语句。
-
Hibernate 经常这样做,我可能只是忽略了。尽管我认为当参数集数量等于批量大小时这样做有点奇怪。
英文:
I don't have a conclusive answer, but this behaviour seems very similar and probably for the same reason as that of batch fetching.
It uses different statements with the number of parameter sets equal to powers of two.
This is to minimize the number of different statements executed.
Databases need to parse statements and use caches to hold parsed statements.
If a client executes tons of statements that are doing essentially the same thing but differ in number of parameter sets this would render the cache useless.
On the other hand I haven't seen it with batch inserts but only with bulk fetch operations.
I have a couple of guesses why this might be happening:
-
Your ids get generated by the database, so before the data can be written to the database ids need to get queried from the database sequence. Maybe the select behaviour than leaks through to the inserts
-
It could be a an optimisation done by the JDBC driver that is rewriting this kind auf statements.
-
Hibernate does that all the time and I just missed that. Although I think it is weird to do it when the number of parameter sets is equal to the batch size.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论