2023年2月24日 15:09:06go评论71阅读模式

英文:

Select mismatched columns and values from two exactly same spark python dataframes

问题

我想选择两个来自不同数据源的完全相同的数据框中不匹配的列和它们的值。

我现在有的内容：

col_1	key_col	col_2	col_3	col_4
a	key1	b	c	d
w	key2	x	y	z

col_1	key_col	col_2	col_3	col_4
a	key1	b	p	q
w	key2	x	y	z

我有两个来自不同数据源的具有相同模式的数据框。

我想要的是：

使用"key_col"作为连接键，对这两个数据框进行内连接，并以以下格式输出：

对于连接后获得的表的每一行，返回以下行：

key_col	不匹配的列名称	第一个数据框中的不匹配的值	第二个数据框中的不匹配的值
key1	[col_3, col_4]	[c,d]	[p,q]

我正在寻找在pyspark中执行此操作的查询。

英文:

I want to select the mismatched columns and their values from two exactly same dataframes from different sources.

What I have now:

col_1	key_col	col_2	col_3	col_4
a	key1	b	c	d
w	key2	x	y	z

col_1	key_col	col_2	col_3	col_4
a	key1	b	p	q
w	key2	x	y	z

I have 2 dataframes with same schema from different data sources.

What I want:

Join (inner join) the 2 dataframes using the "key_col" as the join key and give the output in the following format:

For each row of the table obtained after the join, return the following row:

key_col	mismatched_column_names	mismatched_values_in_first_df	mismatched_values_in_second_df
key 1	[col3, col4]	[c,d]	[p,q]

I am looking for the query to do so in pyspark.

答案1

得分: 1

这部分是代码，不需要翻译。

以下是输出的部分：

输出：

英文:

This would work:

df1.alias(&quot;df1&quot;).join(df2.alias(&quot;df2&quot;), F.col(&quot;df1.key_col&quot;)==F.col(&quot;df2.key_col&quot;))\
 .select(&quot;df1.key_col&quot;, *[F.when(F.col(&quot;df1.&quot;+col) != F.col(&quot;df2.&quot;+col), F.create_map(&quot;df1.&quot;+col,&quot;df2.&quot;+col)).alias(col) for col in df1.schema.names if col!=&quot;key_col&quot;])\
 .withColumn(&quot;merged&quot;, F.map_concat(*[F.coalesce(F.col(col), F.create_map().cast(&quot;map&lt;string,string&gt;&quot;)) for col in df1.schema.names if col!=&quot;key_col&quot;]))\
 .withColumn(&quot;mismatched_column_names&quot;, F.array(*[F.when(F.col(col).isNotNull(), F.lit(col)) for col in df1.schema.names if col!=&quot;key_col&quot;]))\
 .withColumn(&quot;mismatched_column_names&quot;, F.expr(&#39;filter(mismatched_column_names, x -&gt; x is not null)&#39;))\
 .withColumn(&quot;mismatched_values_in_first_df&quot;, F.map_keys(&quot;merged&quot;))\
 .withColumn(&quot;mismatched_values_in_second_df&quot;, F.map_values(&quot;merged&quot;))\
 .filter(F.size(&quot;mismatched_values_in_second_df&quot;) != 0)\
 .select(&quot;df1.key_col&quot;, &quot;mismatched_column_names&quot;, &quot;mismatched_values_in_first_df&quot;, &quot;mismatched_values_in_second_df&quot;)\
 .show(truncate=False)

Input:

DF1:

DF2:

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从两个完全相同的Spark Python数据框中选择不匹配的列和数值。

问题

答案1

在Pandas中在特定列上垂直合并数据框。

Combining multiple groups in Polars

为什么我的函数陷入无限循环？

在Python中，我可以同时重写抽象访问器（get/set）吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论