PySpark / Snowpark 左反连接过程中的随机列名问题

huangapple go评论59阅读模式
英文:

PySpark / Snowpark random column name during left anti join problem

问题

I am trying to compare two dataframes, to get new records to be inserted into an incremental table.

我正在尝试比较两个数据框,以获取要插入到增量表中的新记录。

I am following previously asked questions, example here.

我正在遵循先前提出的问题,示例在此处

But I'm having this problem, where even though I used the alias function, it seems that the output dataframe is having random column names. At least that is what I see in the Snowflake query history.

但我遇到了这个问题,即使我使用了alias函数,输出的数据框似乎具有随机列名。至少这是我在Snowflake查询历史中看到的情况。

Sample code:

示例代码:

main = session.table("source")
incremental = session.table("target")

new_customer = (incremental
        .join(main, incremental.customer_id==main.customer_id, "leftanti" )
        .select(main.customer_id.alias("customer_id"))
                        ).show()

Error:
错误:

snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier "r_tu5s_customer_id".

Query history:
查询历史:

SELECT "r_tu5s_customer_id" AS "customer_id" FROM ( SELECT  *  FROM ( SELECT "customer_id" .........

(Note: I've translated the provided text as requested. If you need further assistance or have more specific questions, feel free to ask.)

英文:

I am trying to compare two dataframes, to get new records to be inserted into an incremental table.

I am following previously asked questions, example https://stackoverflow.com/questions/72181011/how-to-compare-two-dataframes-and-extract-unmatched-rows-in-pyspark

But I'm having this problem, where even though I used the alias function, it seems that the output dataframe is having random column names. At least that is what I see in the Snowflake query history.

Sample code:

main = session.table("source")
incremental = session.table("target")

new_customer = (incremental
        .join(main, incremental.customer_id==main.customer_id, "leftanti" )
        .select(main.customer_id.alias("customer_id"))
                        ).show()

Error:

snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier '"r_tu5s_customer_id".

Query history:

SELECT "r_tu5s_customer_id" AS "customer_id" FROM ( SELECT  *  FROM ( SELECT "customer_id" .........

答案1

得分: 1

将注释转化为答案以供他人使用。

leftanti类似于连接功能,但仅返回__左侧__DataFrame中非匹配的记录的列。

因此,解决方案只是交换两个数据帧,以便您可以获取main数据帧中不存在于incremental数据帧中的新记录。

main.join(incremental, incremental.customer_id==main.customer_id, "leftanti")
英文:

Turning the comment into an answer to be useful for others.

The leftanti is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

So the solution is just swtiching the two dataframes so you can get the new records in main df that don't exist in incremental df

main.join(incremental, incremental.customer_id==main.customer_id, "leftanti" )

huangapple
  • 本文由 发表于 2023年6月8日 05:13:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定