2023年8月5日 16:12:05go评论172阅读模式

英文:

writing spark df to azure sql server with clustered columnstore index and PK/FK

问题

考虑以下用例：我想使用Microsoft的Apache Spark SQL Connector将一个Spark数据框写入存储在Azure SQL Server数据库中的数据仓库，可以根据用例选择“覆盖”或“追加”。

对于数据仓库，建议使用聚集列存储索引。由于我还希望强制执行主键约束，非聚集索引在聚集列存储索引上可能会产生最佳性能。

如何使用这种索引设计在PySpark中创建或追加表？

df.write \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .mode("overwrite OR append") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", sqldbuser) \
        .option("password", sqldbpwd)

答案不一定要完全基于PySpark（例如，可以使用SQLAlchemy）。它应该能够稳健地工作。

英文:

Consider the following usecase:I want to write a spark data frame to a dwh stored in a azure sql server db using Microsoft’s Apache Spark SQL Connector. Either with "overwrite" or "append" depending on the usecase.

For dwhs, clustered columnstore indexes are recommended. As I also want primary key constrains to be enforced, Nonclustered indexes on a clustered columnstore index probably would yield the best performance.

How do I create or append to tables with pyspark using this index design?

df.write \
        .format(&quot;com.microsoft.sqlserver.jdbc.spark&quot;) \
        .mode(&quot;overwrite OR append&quot;) \
        .option(&quot;url&quot;, url) \
        .option(&quot;dbtable&quot;, table_name) \
        .option(&quot;user&quot;, sqldbuser) \
        .option(&quot;password&quot;, sqldbpwd)

The answer does not have to be purely pyspark based (e.g. can use sqlalchemy). It should just work robustly.

答案1

得分: 1

你的最佳方法是将来自pyspark的数据帧插入Azure SQL服务器上的一个分页表（使用覆盖方式），然后使用存储过程将数据插入到列存储索引的数据仓库表中。

在存储过程中，可以使用以下方式：

    insert into dwh_table
    select *
    from stg_table

另外，尝试一次插入多行，以便列存储索引可以高效地压缩它（超过1,048,576行）。

关于非集群索引 - 如果您需要将其作为主键以防止重复项，则在加载过程中必须将其设置为活动状态。

如果不需要，可以尝试在加载前（在存储过程中）禁用非集群索引，然后在加载后启用它们。请注意，这并不总是能提供最佳性能，您需要根据自己的工作负载进行测试。

请参考此文档：https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16

英文:

Your best approach would be to insert the dataframe from pyspark into a staging table on Azure SQL server (with overwrite), then use a stored procedure to insert the data into your columnstore index dwh table.

In the stored procedure, use something like that:

    insert into dwh_table
    select *
    from stg_table

Also, try inserting multiple rows at once so that the columnstore index can compress it efficiently (over 1,048,576 rows).

About nonclusterd index - if you need it as a primary key, to prevent duplicates, then you have no choice but to have it active during load.

If not, try to disable nonclustered indexes before load (in your stored procedure), and enable them after load. Please note that this does not always give you the best performance, you'll have to test it with your workload.

Please refer to this document: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将Spark DataFrame写入Azure SQL Server，使用集群列存储索引和主键/外键。

问题

答案1

为什么 Spark 模式的 .simpleString() 方法会截断我的输出？

如何设置 Delta Live Tables 授权以访问 Azure Data Lake 文件？

将字符串 dd.mm.yyyy 转换为日期格式 yyyy-MM-dd，使用 Pyspark。

How to Authenitcate Access to ADLS from Databricks without creating Service Principle with Using AD App Registrations to Mount a Drive

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论