2023年2月14日 21:02:47go评论73阅读模式

英文:

Databricks DLT pipeline with for..loop reports error "AnalysisException: Cannot redefine dataset"

问题

我有以下代码，对于单个表运行良好。但当我尝试使用for..loop()处理数据库中的所有表时，出现错误，"AnalysisException: 无法重新定义数据集'source_ds'，Map()，Map()，List()，List()，Map())"。

我需要将表名传递给source_ds，以便根据键和sequence_columns处理CDC。请提供任何帮助/建议。

英文:

I have the following code which works fine for a single table. But when I try to use a for..loop() to process all the tables in my database, I am getting the error, "AnalysisException: Cannot redefine dataset 'source_ds',Map(),Map(),List(),List(),Map())".

I need to pass the table name to source_ds so as to process CDC based on key & sequence_columns. Appreciate any help/suggestions please.

import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
raw_db_name = &quot;raw_db&quot;

def generate_silver_tables(target_table, source_table, keys_col_list):

 @dlt.table
 def source_ds():
        return spark.table(f&quot;{raw_db_name}.{source_table}&quot;)

  ### Create the target table definition
 dlt.create_target_table(name=target_table,
 comment= f&quot;Clean, merged {target_table}&quot;,
 #partition_cols=[&quot;topic&quot;],
 table_properties={
   &quot;quality&quot;: &quot;silver&quot;,
   &quot;pipelines.autoOptimize.managed&quot;: &quot;true&quot;
 }
 )
  
 ## Do the merge
 dlt.apply_changes(
   target = target_table,
   source = &quot;source_ds&quot;,
   keys = keys_col_list,
   apply_as_deletes = expr(&quot;operation = &#39;DELETE&#39;&quot;),
   sequence_by = col(&quot;ts_ms&quot;),
   ignore_null_updates = False,
   except_column_list = [&quot;operation&quot;, &quot;timestamp_ms&quot;],
   stored_as_scd_type = &quot;1&quot;
 )
 return

# THIS WORKS FINE
#---------------
# raw_dbname = &quot;raw_db&quot;
# raw_tbl_name = &#39;raw_table&#39;
# processed_tbl_name = raw_tbl_name.replace(&quot;raw&quot;, &quot;processed&quot;)
# generate_silver_tables(processed_tbl_name, raw_tbl_name)


table_list = spark.sql(f&quot;show tables in landing_db &quot;).collect()
for row in table_list:
    landing_tbl_name = row.tableName
    s2 = spark.sql(f&quot;select key from {landing_db_name}.{landing_tbl_name} limit 1&quot;)
    keys_col_list = list(json.loads(s2.collect()[0][0]).keys())
    raw_tbl_name = landing_tbl_name.replace(&quot;landing&quot;, &quot;raw&quot;)
    processed_tbl_name = landing_tbl_name.replace(&quot;landing&quot;, &quot;processed&quot;)
    generate_silver_tables(processed_tbl_name, raw_tbl_name, keys_col_list)
#     time.sleep(10)

答案1

得分: 1

你需要为每个表提供一个唯一的名称，通过为源表的dlt.table注释提供name属性，然后在apply_changes中使用相同的名称。否则，它将从函数名称中提取，因为您已经定义了该函数。像这样：

def generate_silver_tables(target_table, source_table, keys_col_list):

 @dlt.table(
    name=source_table
 )
 def source_ds():
        return spark.table(f"{raw_db_name}.{source_table}")

  ### 创建目标表定义
 dlt.create_target_table(name=target_table,
 comment= f"Clean, merged {target_table}",
 #partition_cols=["topic"],
 table_properties={
   "quality": "silver",
   "pipelines.autoOptimize.managed": "true"
 }
 )
  
 ## 进行合并操作
 dlt.apply_changes(
   target = target_table,
   source = source_table,
   keys = keys_col_list,
   apply_as_deletes = expr("operation = 'DELETE'"),
   sequence_by = col("ts_ms"),
   ignore_null_updates = False,
   except_column_list = ["operation", "timestamp_ms"],
   stored_as_scd_type = "1"
 )
 return

请参阅DLT Cookbook以获取完整示例。

英文:

You need to give unique names to each table by providing name attribute to the dlt.table annotation for source table, and then use the same name in the apply_changes. Otherwise it will be take from the function name and fail because you already defined that function. Like this:

def generate_silver_tables(target_table, source_table, keys_col_list):

 @dlt.table(
    name=source_table
 )
 def source_ds():
        return spark.table(f&quot;{raw_db_name}.{source_table}&quot;)

  ### Create the target table definition
 dlt.create_target_table(name=target_table,
 comment= f&quot;Clean, merged {target_table}&quot;,
 #partition_cols=[&quot;topic&quot;],
 table_properties={
   &quot;quality&quot;: &quot;silver&quot;,
   &quot;pipelines.autoOptimize.managed&quot;: &quot;true&quot;
 }
 )
  
 ## Do the merge
 dlt.apply_changes(
   target = target_table,
   source = source_table,
   keys = keys_col_list,
   apply_as_deletes = expr(&quot;operation = &#39;DELETE&#39;&quot;),
   sequence_by = col(&quot;ts_ms&quot;),
   ignore_null_updates = False,
   except_column_list = [&quot;operation&quot;, &quot;timestamp_ms&quot;],
   stored_as_scd_type = &quot;1&quot;
 )
 return

See DLT Cookbook for full example.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

“Databricks DLT pipeline with for..loop reports error ‘AnalysisException: Cannot redefine dataset'”

问题

答案1

使用Python从RSS源解析数据到CSV时遇到了薪资字段的显示问题。

为什么我在使用Python3的Selenium时无法检索到完整的Cookie列表？

我的方法无法消除单链表中最大重复元素序列。

Pytest caplog (log output)

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论