2023年6月8日 05:13:33go评论59阅读模式

英文:

PySpark / Snowpark random column name during left anti join problem

问题

I am trying to compare two dataframes, to get new records to be inserted into an incremental table.

我正在尝试比较两个数据框，以获取要插入到增量表中的新记录。

I am following previously asked questions, example here.

我正在遵循先前提出的问题，示例在此处。

But I'm having this problem, where even though I used the alias function, it seems that the output dataframe is having random column names. At least that is what I see in the Snowflake query history.

但我遇到了这个问题，即使我使用了alias函数，输出的数据框似乎具有随机列名。至少这是我在Snowflake查询历史中看到的情况。

Sample code:

示例代码：

main = session.table("source")
incremental = session.table("target")

new_customer = (incremental
        .join(main, incremental.customer_id==main.customer_id, "leftanti" )
        .select(main.customer_id.alias("customer_id"))
                        ).show()

Error:
错误：

snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier "r_tu5s_customer_id".

Query history:
查询历史：

SELECT "r_tu5s_customer_id" AS "customer_id" FROM ( SELECT  *  FROM ( SELECT "customer_id" .........

(Note: I've translated the provided text as requested. If you need further assistance or have more specific questions, feel free to ask.)

英文:

I am trying to compare two dataframes, to get new records to be inserted into an incremental table.

I am following previously asked questions, example https://stackoverflow.com/questions/72181011/how-to-compare-two-dataframes-and-extract-unmatched-rows-in-pyspark

Sample code:

main = session.table(&quot;source&quot;)
incremental = session.table(&quot;target&quot;)

new_customer = (incremental
        .join(main, incremental.customer_id==main.customer_id, &quot;leftanti&quot; )
        .select(main.customer_id.alias(&quot;customer_id&quot;))
                        ).show()

Error:

snowflake.snowpark.exceptions.SnowparkSQLException: 1304): 01acd0d6-3201-cee0-0000-6b1502059ff6: 000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier &#39;&quot;r_tu5s_customer_id&quot;.

Query history:

SELECT &quot;r_tu5s_customer_id&quot; AS &quot;customer_id&quot; FROM ( SELECT  *  FROM ( SELECT &quot;customer_id&quot; .........

答案1

得分: 1

将注释转化为答案以供他人使用。

leftanti类似于连接功能，但仅返回__左侧__DataFrame中非匹配的记录的列。

因此，解决方案只是交换两个数据帧，以便您可以获取main数据帧中不存在于incremental数据帧中的新记录。

main.join(incremental, incremental.customer_id==main.customer_id, "leftanti")

英文:

Turning the comment into an answer to be useful for others.

The leftanti is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

So the solution is just swtiching the two dataframes so you can get the new records in main df that don't exist in incremental df

main.join(incremental, incremental.customer_id==main.customer_id, &quot;leftanti&quot; )

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark / Snowpark 左反连接过程中的随机列名问题

问题

答案1

有没有更有效的方法来使用Pyspark筛选上个月（或X个上个月）的数据？

如何查询一个列是否存在于另一个列中？

Spark Scala Dataframe中的`case when`类似函数

Snowpark Python日期格式

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论