如何提高Pentaho转换的输入/输出速度

huangapple go评论46阅读模式
英文:

How to increase pentaho transformation input/output speed

问题

我正在使用Pentaho数据集成(PDI)- Spoon 来创建ETL,我非常关注性能。
我开发了一个ETL来处理从MySQL 8复制的250万行数据(每行有104列)到Clickhouse数据库,需要30分钟。
目标表没有任何索引和约束,它是一个列式数据库。

我正在使用Linux Ubuntu 22.04,转换通过spoon.sh在Pentaho服务器上运行。

如何提高转换的输入/输出速度?

我仅使用4个步骤:-

  1. 通过执行SQL脚本截断表
  2. 通过表输入获取数据
  3. 通过选择值更改日期格式
  4. 通过表输出将数据插入目标表

我想提高PDI-Spoon转换的I/O速度。

英文:

I am using Pentaho data integration (PDI)-spoon to create ETL's and I am very focused on performance.
I develop an ETL to process that copy of 2,500,000 rows (each row has 104 columns) from MySQL 8 to Clickhouse database and it takes 30 min.
Destination table does not have any indexes and constraints and it is a columnar database.

I am using linux ubuntu 22.04 and transformation running on pentaho server through spoon.sh

How to increase the transformation input/output speed?

I am using only 4 steps:-
Truncate table by using EXECUTE SQL SCRIPT --> Fetch data by using TABLE INPUT--> Changing date formats by using SELECT VALUES ---> insert data into destination table by using TABLE OUTPUT.

I want to increase the I/O speed of the PDI-Spoon transformation

答案1

得分: 1

如果是一次性迁移,您可以尝试在ClickHouse中使用MySQL表引擎,结合INSERT FROM SELECT语法,直接从您的ClickHouse实例迁移数据,这应该会更快。

链接:
https://clickhouse.cloud/integrations/mysql
https://clickhouse.com/docs/en/engines/table-engines/integrations/mysql

英文:

If it's a point-in-time migration, uou can try using the MySQL Table engine in ClickHouse in conjunction with INSERT FROM SELECT syntax to migrate your data directly from your ClickHouse instance, it should be faster.

https://clickhouse.cloud/integrations/mysql
https://clickhouse.com/docs/en/engines/table-engines/integrations/mysql

答案2

得分: 0

  1. 使用 Carte 集群。
  2. 使用批量加载步骤。
  3. 使用 "Bulk Load from MYSQL to File" 步骤在 PDI 中将数据批量加载到文件中,然后可以使用 ClickHouse 数据库的原生 API 将数据从文件加载到表中。
  4. 或者使用文本文件输出将数据写入文件,然后将数据上传到 ClickHouse 数据库。
英文:

There are couple of options in PDI to improve the I/O:

  1. Use of Carte Cluster.

You can make use of multiple carte servers and execute the job to run in parallel across the cluster. The way to create carte cluster and execute them is written in this blog.

  1. Using Bulk Loader step

You can use the Bulk Load from MYSQL to File step in PDI to bulk load the data into a file. This step will improve your I/O. You can later use native clickhouse database APIs to load the data from file to table.

Alternatively, you can also use Text File Output to write the data into a file and upload the data into the clickhouse database.

huangapple
  • 本文由 发表于 2023年4月6日 20:22:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75949461.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定