比较在一个分组内的所有行的 PySpark 数据框。

huangapple go评论68阅读模式
英文:

comparing all the rows with in a group pyspark dataframe

问题

Here is the translated code snippet:

我有以下数据帧我需要在`company`组内比较`first_nm``sur_nm`列的行值根据匹配情况我会在输出的`status`列中分配一个值

例如

如果在特定的`company`组中所有行的`first_nm``sur_nm`都匹配 - `status`为0

如果在`company`组中只有`first_nm`匹配 - `status`为1

如果在`company`组中只有`sur_nm`匹配 - `status`为2

如果没有匹配或有空值 - `status`为99

输出的数据帧如下

```python
+--------+--------+----------------+--------------+-------+
| company|      id|        first_nm|        sur_nm| status|
+--------+--------+----------------+--------------+-------+
|SYNTHE01|SYNTHE02|           JAMES|        FOWLER|      2|
|SYNTHE01|SYNTHE03|          MONICA|        FOWLER|      2|
|SYNTHE01|SYNTHE04|          GEORGE|        FOWLER|      2|
|SYNTHE08|SYNTHE05|           JAMES|        FIWLER|      1|
|SYNTHE08|SYNTHE06|           JAMES|        FUWLER|      1|
|SYNTHE08|SYNTHE07|           JAMES|        FAWLER|      1|
|SYNTHE08|SYNTHE08|           JAMES|        FEWLER|      1|
|SYNTHE11|SYNTHE12|           JAMES|        FOWLER|      0|
|SYNTHE11|SYNTHE11|           JAMES|        FOWLER|      0|
|SYNTHE09|SYNTHE0X|            Null|          Null|     99|
|SYNTHE09|SYNTHE0Y|            Null|          Null|     99|
|SYNTHE09|SYNTHE0Z|            Null|          Null|     99|
+--------+--------+----------------+--------------+-------+

如何处理不同行值的列内比较,请指导。

谢谢


<details>
<summary>英文:</summary>
I have dataframe as below, where I need to compare the row values of column `first_nm` and `sur_nm` within a group based on `company`. Based on the matching I would assign a value to `status` column in the output.

+--------+--------+----------------+--------------+
| company| id| first_nm| sur_nm|
+--------+--------+----------------+--------------+
|SYNTHE01|SYNTHE02| JAMES| FOWLER|
|SYNTHE01|SYNTHE03| MONICA| FOWLER|
|SYNTHE01|SYNTHE04| GEORGE| FOWLER|
|SYNTHE08|SYNTHE05| JAMES| FIWLER|
|SYNTHE08|SYNTHE06| JAMES| FUWLER|
|SYNTHE08|SYNTHE07| JAMES| FAWLER|
|SYNTHE08|SYNTHE08| JAMES| FEWLER|
|SYNTHE11|SYNTHE12| JAMES| FOWLER|
|SYNTHE11|SYNTHE11| JAMES| FOWLER|
|SYNTHE09|SYNTHE0X| Null| Null|
|SYNTHE09|SYNTHE0Y| Null| Null|
|SYNTHE09|SYNTHE0Z| Null| Null|
+--------+--------+----------------+--------------+


For eg.
If both `first_nm` and `sur_nm` of all rows matches in a particular `company` - `status` is 0.
If only `first_nm` matches in a `company` group - `status` is 1.
If only `sur_nm` matches in a `company` group - `status` is 2.
If nothing matches or null values - `status` is 99.
The output dataframe is as below:

+--------+--------+----------------+--------------+-------+
| company| id| first_nm| sur_nm| status|
+--------+--------+----------------+--------------+-------+
|SYNTHE01|SYNTHE02| JAMES| FOWLER| 2|
|SYNTHE01|SYNTHE03| MONICA| FOWLER| 2|
|SYNTHE01|SYNTHE04| GEORGE| FOWLER| 2|
|SYNTHE08|SYNTHE05| JAMES| FIWLER| 1|
|SYNTHE08|SYNTHE06| JAMES| FUWLER| 1|
|SYNTHE08|SYNTHE07| JAMES| FAWLER| 1|
|SYNTHE08|SYNTHE08| JAMES| FEWLER| 1|
|SYNTHE11|SYNTHE12| JAMES| FOWLER| 0|
|SYNTHE11|SYNTHE11| JAMES| FOWLER| 0|
|SYNTHE09|SYNTHE0X| Null| Null| 99|
|SYNTHE09|SYNTHE0Y| Null| Null| 99|
|SYNTHE09|SYNTHE0Z| Null| Null| 99|
+--------+--------+----------------+--------------+-------+


How can we handle this kind of compare within a column for different row values. Please guide.
Thank you
</details>
# 答案1
**得分**: 2
你的DataFrame(df):
```plaintext
+--------+--------+--------+------+
| 公司   |   id   | 名字    | 姓氏   |
+--------+--------+--------+------+
|SYNTHE01|SYNTHE02|   JAMES|FOWLER|
|SYNTHE01|SYNTHE03|  MONICA|FOWLER|
|SYNTHE01|SYNTHE04|  GEORGE|FOWLER|
|SYNTHE08|SYNTHE05|   JAMES|FIWLER|
|SYNTHE08|SYNTHE06|   JAMES|FUWLER|
|SYNTHE08|SYNTHE07|   JAMES|FAWLER|
|SYNTHE08|SYNTHE08|   JAMES|FEWLER|
|SYNTHE11|SYNTHE12|   JAMES|FOWLER|
|SYNTHE11|SYNTHE11|   JAMES|FOWLER|
|SYNTHE09|SYNTHE0X|    空  |  空  |
|SYNTHE09|SYNTHE0Y|    空  |  空  |
|SYNTHE09|SYNTHE0Z|    空  |  空  |
+--------+--------+--------+------+
  1. 导入必要的包:
from pyspark.sql.functions import col, when, size, collect_set
  1. 获取first_nmsur_nm的唯一计数:
unique_df = df.groupBy("公司").agg(
    size(collect_set("名字")).alias("名字计数"),
    size(collect_set("姓氏")).alias("姓氏计数")
)
  1. 应用条件:
company_status_df = unique_df.withColumn("状态",
                   when((col("名字计数") == 1) & (col("姓氏计数") == 1), 0)
                   .when(col("名字计数") == 1, 1)
                   .when(col("姓氏计数") == 1, 2)
                   .otherwise(99)
              ).select("公司", "状态")
  1. 与原始DataFrame df 进行连接:
df.join(company_status_df, "公司").show()

输出结果

+--------+--------+--------+------+------+  
| 公司   |   id   | 名字    | 姓氏   | 状态 |
+--------+--------+--------+------+------+  
|SYNTHE01|SYNTHE02|   JAMES|FOWLER|   2  |
|SYNTHE01|SYNTHE03|  MONICA|FOWLER|   2  |
|SYNTHE01|SYNTHE04|  GEORGE|FOWLER|   2  |
|SYNTHE08|SYNTHE05|   JAMES|FIWLER|   1  |
|SYNTHE08|SYNTHE06|   JAMES|FUWLER|   1  |
|SYNTHE08|SYNTHE07|   JAMES|FAWLER|   1  |
|SYNTHE08|SYNTHE08|   JAMES|FEWLER|   1  |
|SYNTHE11|SYNTHE12|   JAMES|FOWLER|   0  |
|SYNTHE11|SYNTHE11|   JAMES|FOWLER|   0  |
|SYNTHE09|SYNTHE0X|    空  |  空  |  99  |
|SYNTHE09|SYNTHE0Y|    空  |  空  |  99  |
|SYNTHE09|SYNTHE0Z|    空  |  空  |  99  |
+--------+--------+--------+------+------+  

请注意,我已将DataFrame中的中文内容翻译成了中文。如果有其他问题,请告诉我。

英文:

Your DataFrame (df):

+--------+--------+--------+------+
| company|      id|first_nm|sur_nm|
+--------+--------+--------+------+
|SYNTHE01|SYNTHE02|   JAMES|FOWLER|
|SYNTHE01|SYNTHE03|  MONICA|FOWLER|
|SYNTHE01|SYNTHE04|  GEORGE|FOWLER|
|SYNTHE08|SYNTHE05|   JAMES|FIWLER|
|SYNTHE08|SYNTHE06|   JAMES|FUWLER|
|SYNTHE08|SYNTHE07|   JAMES|FAWLER|
|SYNTHE08|SYNTHE08|   JAMES|FEWLER|
|SYNTHE11|SYNTHE12|   JAMES|FOWLER|
|SYNTHE11|SYNTHE11|   JAMES|FOWLER|
|SYNTHE09|SYNTHE0X|    null|  null|
|SYNTHE09|SYNTHE0Y|    null|  null|
|SYNTHE09|SYNTHE0Z|    null|  null|
+--------+--------+--------+------+
  1. Importing necessary packages
from pyspark.sql.functions import col, when, size, collect_set
  1. Get the unique count of first_nm and sur_nm
unique_df = df.groupBy(&quot;company&quot;).agg(
size(collect_set(&quot;first_nm&quot;)).alias(&quot;first_nm_size&quot;),
size(collect_set(&quot;sur_nm&quot;)).alias(&quot;sur_nm_size&quot;)
)
  1. Apply the condition
company_status_df = unique_df.withColumn(&quot;status&quot;, 
when((col(&quot;first_nm_size&quot;) == 1) &amp; (col(&quot;sur_nm_size&quot;) == 1), 0)
.when(col(&quot;first_nm_size&quot;) == 1, 1)
.when(col(&quot;sur_nm_size&quot;) == 1, 2)
.otherwise(99)
).select(&quot;company&quot;, &quot;status&quot;)
  1. Join it with the original DataFrame df
df.join(company_status_df, &quot;company&quot;).show()

Output:

+--------+--------+--------+------+------+
| company|      id|first_nm|sur_nm|status|
+--------+--------+--------+------+------+
|SYNTHE01|SYNTHE02|   JAMES|FOWLER|     2|
|SYNTHE01|SYNTHE03|  MONICA|FOWLER|     2|
|SYNTHE01|SYNTHE04|  GEORGE|FOWLER|     2|
|SYNTHE08|SYNTHE05|   JAMES|FIWLER|     1|
|SYNTHE08|SYNTHE06|   JAMES|FUWLER|     1|
|SYNTHE08|SYNTHE07|   JAMES|FAWLER|     1|
|SYNTHE08|SYNTHE08|   JAMES|FEWLER|     1|
|SYNTHE11|SYNTHE12|   JAMES|FOWLER|     0|
|SYNTHE11|SYNTHE11|   JAMES|FOWLER|     0|
|SYNTHE09|SYNTHE0X|    null|  null|    99|
|SYNTHE09|SYNTHE0Y|    null|  null|    99|
|SYNTHE09|SYNTHE0Z|    null|  null|    99|
+--------+--------+--------+------+------+

答案2

得分: 2

通过对分组的company列使用多个when条件(用于在组内计算count distinct值),可以实现如下代码:

import pyspark.sql.functions as F

df = df.join(df.groupby('company')
             .agg(F.when((F.countDistinct('first_nm') == 1) & (F.countDistinct('sur_nm') == 1), 0)
                   .when(F.countDistinct('first_nm') == 1, 1)
                   .when(F.countDistinct('sur_nm') == 1, 2).otherwise(99)
                   .alias('status')), on='company')

df.show(truncate=False)

这段代码将根据不同的条件计算status列的值,并将结果连接回原始的DataFrame。

英文:

Via multiple when conditions (to count distinct values in groups) on grouped company column:

import pyspark.sql.functions as F
df = df.join(df.groupby(&#39;company&#39;)
.agg(F.when((F.countDistinct(&#39;first_nm&#39;) == 1) &amp; (F.countDistinct(&#39;sur_nm&#39;) == 1), 0)
.when(F.countDistinct(&#39;first_nm&#39;) == 1, 1)
.when(F.countDistinct(&#39;sur_nm&#39;) == 1, 2).otherwise(99)
.alias(&#39;status&#39;)), on=&#39;company&#39;)
df.show(truncate=False)

+--------+--------+--------+------+------+
|company |id      |first_nm|sur_nm|status|
+--------+--------+--------+------+------+
|SYNTHE01|SYNTHE02|JAMES   |FOWLER|2     |
|SYNTHE01|SYNTHE03|MONICA  |FOWLER|2     |
|SYNTHE01|SYNTHE04|GEORGE  |FOWLER|2     |
|SYNTHE08|SYNTHE05|JAMES   |FIWLER|1     |
|SYNTHE08|SYNTHE06|JAMES   |FUWLER|1     |
|SYNTHE08|SYNTHE07|JAMES   |FAWLER|1     |
|SYNTHE08|SYNTHE08|JAMES   |FEWLER|1     |
|SYNTHE11|SYNTHE12|JAMES   |FOWLER|0     |
|SYNTHE11|SYNTHE11|JAMES   |FOWLER|0     |
|SYNTHE09|SYNTHE0X|null    |null  |99    |
|SYNTHE09|SYNTHE0Y|null    |null  |99    |
|SYNTHE09|SYNTHE0Z|null    |null  |99    |
+--------+--------+--------+------+------+

huangapple
  • 本文由 发表于 2023年5月6日 20:26:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76188882.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定