将Spark DataFrame写入Azure SQL Server,使用集群列存储索引和主键/外键。

huangapple go评论81阅读模式
英文:

writing spark df to azure sql server with clustered columnstore index and PK/FK

问题

考虑以下用例:我想使用Microsoft的Apache Spark SQL Connector将一个Spark数据框写入存储在Azure SQL Server数据库中的数据仓库,可以根据用例选择“覆盖”或“追加”。

对于数据仓库,建议使用聚集列存储索引。由于我还希望强制执行主键约束,非聚集索引在聚集列存储索引上可能会产生最佳性能。

如何使用这种索引设计在PySpark中创建或追加表?

df.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .mode("overwrite OR append") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", sqldbuser) \
        .option("password", sqldbpwd)

答案不一定要完全基于PySpark(例如,可以使用SQLAlchemy)。它应该能够稳健地工作。

英文:

Consider the following usecase:I want to write a spark data frame to a dwh stored in a azure sql server db using Microsoft’s Apache Spark SQL Connector. Either with "overwrite" or "append" depending on the usecase.

For dwhs, clustered columnstore indexes are recommended. As I also want primary key constrains to be enforced, Nonclustered indexes on a clustered columnstore index probably would yield the best performance.

How do I create or append to tables with pyspark using this index design?

df.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .mode("overwrite OR append") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", sqldbuser) \
        .option("password", sqldbpwd)

The answer does not have to be purely pyspark based (e.g. can use sqlalchemy). It should just work robustly.

答案1

得分: 1

你的最佳方法是将来自pyspark的数据帧插入Azure SQL服务器上的一个分页表(使用覆盖方式),然后使用存储过程将数据插入到列存储索引的数据仓库表中。

在存储过程中,可以使用以下方式:

    insert into dwh_table
    select *
    from stg_table

另外,尝试一次插入多行,以便列存储索引可以高效地压缩它(超过1,048,576行)。

关于非集群索引 - 如果您需要将其作为主键以防止重复项,则在加载过程中必须将其设置为活动状态。

如果不需要,可以尝试在加载前(在存储过程中)禁用非集群索引,然后在加载后启用它们。请注意,这并不总是能提供最佳性能,您需要根据自己的工作负载进行测试。

请参考此文档:https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16

英文:

Your best approach would be to insert the dataframe from pyspark into a staging table on Azure SQL server (with overwrite), then use a stored procedure to insert the data into your columnstore index dwh table.

In the stored procedure, use something like that:

    insert into dwh_table
    select *
    from stg_table

Also, try inserting multiple rows at once so that the columnstore index can compress it efficiently (over 1,048,576 rows).

About nonclusterd index - if you need it as a primary key, to prevent duplicates, then you have no choice but to have it active during load.

If not, try to disable nonclustered indexes before load (in your stored procedure), and enable them after load. Please note that this does not always give you the best performance, you'll have to test it with your workload.

Please refer to this document: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16

huangapple
  • 本文由 发表于 2023年8月5日 16:12:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76840708.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定