英文:
PySpark / Snowpark random column name during left anti join problem
问题
I am trying to compare two dataframes, to get new records to be inserted into an incremental table.
我正在尝试比较两个数据框,以获取要插入到增量表中的新记录。
I am following previously asked questions, example here.
我正在遵循先前提出的问题,示例在此处。
But I'm having this problem, where even though I used the alias
function, it seems that the output dataframe is having random column names. At least that is what I see in the Snowflake query history.
但我遇到了这个问题,即使我使用了alias
函数,输出的数据框似乎具有随机列名。至少这是我在Snowflake查询历史中看到的情况。
Sample code:
示例代码:
main = session.table("source")
incremental = session.table("target")
new_customer = (incremental
.join(main, incremental.customer_id==main.customer_id, "leftanti" )
.select(main.customer_id.alias("customer_id"))
).show()
Error:
错误:
snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier "r_tu5s_customer_id".
Query history:
查询历史:
SELECT "r_tu5s_customer_id" AS "customer_id" FROM ( SELECT * FROM ( SELECT "customer_id" .........
(Note: I've translated the provided text as requested. If you need further assistance or have more specific questions, feel free to ask.)
英文:
I am trying to compare two dataframes, to get new records to be inserted into an incremental table.
I am following previously asked questions, example https://stackoverflow.com/questions/72181011/how-to-compare-two-dataframes-and-extract-unmatched-rows-in-pyspark
But I'm having this problem, where even though I used the alias
function, it seems that the output dataframe is having random column names. At least that is what I see in the Snowflake query history.
Sample code:
main = session.table("source")
incremental = session.table("target")
new_customer = (incremental
.join(main, incremental.customer_id==main.customer_id, "leftanti" )
.select(main.customer_id.alias("customer_id"))
).show()
Error:
snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier '"r_tu5s_customer_id".
Query history:
SELECT "r_tu5s_customer_id" AS "customer_id" FROM ( SELECT * FROM ( SELECT "customer_id" .........
答案1
得分: 1
将注释转化为答案以供他人使用。
leftanti
类似于连接功能,但仅返回__左侧__DataFrame中非匹配的记录的列。
因此,解决方案只是交换两个数据帧,以便您可以获取main
数据帧中不存在于incremental
数据帧中的新记录。
main.join(incremental, incremental.customer_id==main.customer_id, "leftanti")
英文:
Turning the comment into an answer to be useful for others.
The leftanti
is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.
So the solution is just swtiching the two dataframes so you can get the new records in main
df that don't exist in incremental
df
main.join(incremental, incremental.customer_id==main.customer_id, "leftanti" )
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论