英文:
comparing all the rows with in a group pyspark dataframe
问题
Here is the translated code snippet:
我有以下数据帧,我需要在`company`组内比较`first_nm`和`sur_nm`列的行值。根据匹配情况,我会在输出的`status`列中分配一个值。
例如:
如果在特定的`company`组中所有行的`first_nm`和`sur_nm`都匹配 - `status`为0。
如果在`company`组中只有`first_nm`匹配 - `status`为1。
如果在`company`组中只有`sur_nm`匹配 - `status`为2。
如果没有匹配或有空值 - `status`为99。
输出的数据帧如下:
```python
+--------+--------+----------------+--------------+-------+
| company| id| first_nm| sur_nm| status|
+--------+--------+----------------+--------------+-------+
|SYNTHE01|SYNTHE02| JAMES| FOWLER| 2|
|SYNTHE01|SYNTHE03| MONICA| FOWLER| 2|
|SYNTHE01|SYNTHE04| GEORGE| FOWLER| 2|
|SYNTHE08|SYNTHE05| JAMES| FIWLER| 1|
|SYNTHE08|SYNTHE06| JAMES| FUWLER| 1|
|SYNTHE08|SYNTHE07| JAMES| FAWLER| 1|
|SYNTHE08|SYNTHE08| JAMES| FEWLER| 1|
|SYNTHE11|SYNTHE12| JAMES| FOWLER| 0|
|SYNTHE11|SYNTHE11| JAMES| FOWLER| 0|
|SYNTHE09|SYNTHE0X| Null| Null| 99|
|SYNTHE09|SYNTHE0Y| Null| Null| 99|
|SYNTHE09|SYNTHE0Z| Null| Null| 99|
+--------+--------+----------------+--------------+-------+
如何处理不同行值的列内比较,请指导。
谢谢
<details>
<summary>英文:</summary>
I have dataframe as below, where I need to compare the row values of column `first_nm` and `sur_nm` within a group based on `company`. Based on the matching I would assign a value to `status` column in the output.
+--------+--------+----------------+--------------+
| company| id| first_nm| sur_nm|
+--------+--------+----------------+--------------+
|SYNTHE01|SYNTHE02| JAMES| FOWLER|
|SYNTHE01|SYNTHE03| MONICA| FOWLER|
|SYNTHE01|SYNTHE04| GEORGE| FOWLER|
|SYNTHE08|SYNTHE05| JAMES| FIWLER|
|SYNTHE08|SYNTHE06| JAMES| FUWLER|
|SYNTHE08|SYNTHE07| JAMES| FAWLER|
|SYNTHE08|SYNTHE08| JAMES| FEWLER|
|SYNTHE11|SYNTHE12| JAMES| FOWLER|
|SYNTHE11|SYNTHE11| JAMES| FOWLER|
|SYNTHE09|SYNTHE0X| Null| Null|
|SYNTHE09|SYNTHE0Y| Null| Null|
|SYNTHE09|SYNTHE0Z| Null| Null|
+--------+--------+----------------+--------------+
For eg.
If both `first_nm` and `sur_nm` of all rows matches in a particular `company` - `status` is 0.
If only `first_nm` matches in a `company` group - `status` is 1.
If only `sur_nm` matches in a `company` group - `status` is 2.
If nothing matches or null values - `status` is 99.
The output dataframe is as below:
+--------+--------+----------------+--------------+-------+
| company| id| first_nm| sur_nm| status|
+--------+--------+----------------+--------------+-------+
|SYNTHE01|SYNTHE02| JAMES| FOWLER| 2|
|SYNTHE01|SYNTHE03| MONICA| FOWLER| 2|
|SYNTHE01|SYNTHE04| GEORGE| FOWLER| 2|
|SYNTHE08|SYNTHE05| JAMES| FIWLER| 1|
|SYNTHE08|SYNTHE06| JAMES| FUWLER| 1|
|SYNTHE08|SYNTHE07| JAMES| FAWLER| 1|
|SYNTHE08|SYNTHE08| JAMES| FEWLER| 1|
|SYNTHE11|SYNTHE12| JAMES| FOWLER| 0|
|SYNTHE11|SYNTHE11| JAMES| FOWLER| 0|
|SYNTHE09|SYNTHE0X| Null| Null| 99|
|SYNTHE09|SYNTHE0Y| Null| Null| 99|
|SYNTHE09|SYNTHE0Z| Null| Null| 99|
+--------+--------+----------------+--------------+-------+
How can we handle this kind of compare within a column for different row values. Please guide.
Thank you
</details>
# 答案1
**得分**: 2
你的DataFrame(df):
```plaintext
+--------+--------+--------+------+
| 公司 | id | 名字 | 姓氏 |
+--------+--------+--------+------+
|SYNTHE01|SYNTHE02| JAMES|FOWLER|
|SYNTHE01|SYNTHE03| MONICA|FOWLER|
|SYNTHE01|SYNTHE04| GEORGE|FOWLER|
|SYNTHE08|SYNTHE05| JAMES|FIWLER|
|SYNTHE08|SYNTHE06| JAMES|FUWLER|
|SYNTHE08|SYNTHE07| JAMES|FAWLER|
|SYNTHE08|SYNTHE08| JAMES|FEWLER|
|SYNTHE11|SYNTHE12| JAMES|FOWLER|
|SYNTHE11|SYNTHE11| JAMES|FOWLER|
|SYNTHE09|SYNTHE0X| 空 | 空 |
|SYNTHE09|SYNTHE0Y| 空 | 空 |
|SYNTHE09|SYNTHE0Z| 空 | 空 |
+--------+--------+--------+------+
- 导入必要的包:
from pyspark.sql.functions import col, when, size, collect_set
- 获取
first_nm
和sur_nm
的唯一计数:
unique_df = df.groupBy("公司").agg(
size(collect_set("名字")).alias("名字计数"),
size(collect_set("姓氏")).alias("姓氏计数")
)
- 应用条件:
company_status_df = unique_df.withColumn("状态",
when((col("名字计数") == 1) & (col("姓氏计数") == 1), 0)
.when(col("名字计数") == 1, 1)
.when(col("姓氏计数") == 1, 2)
.otherwise(99)
).select("公司", "状态")
- 与原始DataFrame
df
进行连接:
df.join(company_status_df, "公司").show()
输出结果:
+--------+--------+--------+------+------+
| 公司 | id | 名字 | 姓氏 | 状态 |
+--------+--------+--------+------+------+
|SYNTHE01|SYNTHE02| JAMES|FOWLER| 2 |
|SYNTHE01|SYNTHE03| MONICA|FOWLER| 2 |
|SYNTHE01|SYNTHE04| GEORGE|FOWLER| 2 |
|SYNTHE08|SYNTHE05| JAMES|FIWLER| 1 |
|SYNTHE08|SYNTHE06| JAMES|FUWLER| 1 |
|SYNTHE08|SYNTHE07| JAMES|FAWLER| 1 |
|SYNTHE08|SYNTHE08| JAMES|FEWLER| 1 |
|SYNTHE11|SYNTHE12| JAMES|FOWLER| 0 |
|SYNTHE11|SYNTHE11| JAMES|FOWLER| 0 |
|SYNTHE09|SYNTHE0X| 空 | 空 | 99 |
|SYNTHE09|SYNTHE0Y| 空 | 空 | 99 |
|SYNTHE09|SYNTHE0Z| 空 | 空 | 99 |
+--------+--------+--------+------+------+
请注意,我已将DataFrame中的中文内容翻译成了中文。如果有其他问题,请告诉我。
英文:
Your DataFrame (df):
+--------+--------+--------+------+
| company| id|first_nm|sur_nm|
+--------+--------+--------+------+
|SYNTHE01|SYNTHE02| JAMES|FOWLER|
|SYNTHE01|SYNTHE03| MONICA|FOWLER|
|SYNTHE01|SYNTHE04| GEORGE|FOWLER|
|SYNTHE08|SYNTHE05| JAMES|FIWLER|
|SYNTHE08|SYNTHE06| JAMES|FUWLER|
|SYNTHE08|SYNTHE07| JAMES|FAWLER|
|SYNTHE08|SYNTHE08| JAMES|FEWLER|
|SYNTHE11|SYNTHE12| JAMES|FOWLER|
|SYNTHE11|SYNTHE11| JAMES|FOWLER|
|SYNTHE09|SYNTHE0X| null| null|
|SYNTHE09|SYNTHE0Y| null| null|
|SYNTHE09|SYNTHE0Z| null| null|
+--------+--------+--------+------+
- Importing necessary packages
from pyspark.sql.functions import col, when, size, collect_set
- Get the unique count of
first_nm
andsur_nm
unique_df = df.groupBy("company").agg(
size(collect_set("first_nm")).alias("first_nm_size"),
size(collect_set("sur_nm")).alias("sur_nm_size")
)
- Apply the condition
company_status_df = unique_df.withColumn("status",
when((col("first_nm_size") == 1) & (col("sur_nm_size") == 1), 0)
.when(col("first_nm_size") == 1, 1)
.when(col("sur_nm_size") == 1, 2)
.otherwise(99)
).select("company", "status")
- Join it with the original DataFrame
df
df.join(company_status_df, "company").show()
Output:
+--------+--------+--------+------+------+
| company| id|first_nm|sur_nm|status|
+--------+--------+--------+------+------+
|SYNTHE01|SYNTHE02| JAMES|FOWLER| 2|
|SYNTHE01|SYNTHE03| MONICA|FOWLER| 2|
|SYNTHE01|SYNTHE04| GEORGE|FOWLER| 2|
|SYNTHE08|SYNTHE05| JAMES|FIWLER| 1|
|SYNTHE08|SYNTHE06| JAMES|FUWLER| 1|
|SYNTHE08|SYNTHE07| JAMES|FAWLER| 1|
|SYNTHE08|SYNTHE08| JAMES|FEWLER| 1|
|SYNTHE11|SYNTHE12| JAMES|FOWLER| 0|
|SYNTHE11|SYNTHE11| JAMES|FOWLER| 0|
|SYNTHE09|SYNTHE0X| null| null| 99|
|SYNTHE09|SYNTHE0Y| null| null| 99|
|SYNTHE09|SYNTHE0Z| null| null| 99|
+--------+--------+--------+------+------+
答案2
得分: 2
通过对分组的company
列使用多个when
条件(用于在组内计算count distinct
值),可以实现如下代码:
import pyspark.sql.functions as F
df = df.join(df.groupby('company')
.agg(F.when((F.countDistinct('first_nm') == 1) & (F.countDistinct('sur_nm') == 1), 0)
.when(F.countDistinct('first_nm') == 1, 1)
.when(F.countDistinct('sur_nm') == 1, 2).otherwise(99)
.alias('status')), on='company')
df.show(truncate=False)
这段代码将根据不同的条件计算status
列的值,并将结果连接回原始的DataFrame。
英文:
Via multiple when
conditions (to count distinct
values in groups) on grouped company
column:
import pyspark.sql.functions as F
df = df.join(df.groupby('company')
.agg(F.when((F.countDistinct('first_nm') == 1) & (F.countDistinct('sur_nm') == 1), 0)
.when(F.countDistinct('first_nm') == 1, 1)
.when(F.countDistinct('sur_nm') == 1, 2).otherwise(99)
.alias('status')), on='company')
df.show(truncate=False)
+--------+--------+--------+------+------+
|company |id |first_nm|sur_nm|status|
+--------+--------+--------+------+------+
|SYNTHE01|SYNTHE02|JAMES |FOWLER|2 |
|SYNTHE01|SYNTHE03|MONICA |FOWLER|2 |
|SYNTHE01|SYNTHE04|GEORGE |FOWLER|2 |
|SYNTHE08|SYNTHE05|JAMES |FIWLER|1 |
|SYNTHE08|SYNTHE06|JAMES |FUWLER|1 |
|SYNTHE08|SYNTHE07|JAMES |FAWLER|1 |
|SYNTHE08|SYNTHE08|JAMES |FEWLER|1 |
|SYNTHE11|SYNTHE12|JAMES |FOWLER|0 |
|SYNTHE11|SYNTHE11|JAMES |FOWLER|0 |
|SYNTHE09|SYNTHE0X|null |null |99 |
|SYNTHE09|SYNTHE0Y|null |null |99 |
|SYNTHE09|SYNTHE0Z|null |null |99 |
+--------+--------+--------+------+------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论