英文:
Deriving value of new column based on Group Pyspak
问题
以下是您要翻译的内容:
I have a use case where I want to derive the gender
of a person
by doing GroupBy
.
If the GroupBy
contains MALE
and NEUTRAL
title. We can consider ther person male
.
If the GroupBy
contains FEMALE
and NEUTRAL
title. We can consider ther person female
.
If the GroupBy
contains only NEUTRAL
title. We can consider ther person neutral
.
If the GroupBy
contains FEMALE
and MALE
title. We can consider ther person unknown
.
If the GroupBy
contains only MALE
title. We can consider the person male
. Similarly for FEMALE
, it would be female
.
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL` = ["Professor", "Prof", "Dr"]
Input:
+--------+--------+
| person| title|
+--------+--------+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+--------+
Output:
+--------+--------+--------+
| person| title| gender|
+--------+--------+--------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr| Neutral|
|SYNTHE07| Prof| Neutral|
|SYNTHE08| Mrs| Unknown|
|SYNTHE08| Prof| Unknown|
|SYNTHE08| Mr| Unknown|
+--------+--------+--------+
英文:
I have a use case where I want to derive the gender
of a person
by doing GroupBy
.
If the GroupBy
contains MALE
and NEUTRAL
title. We can consider ther person male
.
If the GroupBy
contains FEMALE
and NEUTRAL
title. We can consider ther person female
.
If the GroupBy
contains only NEUTRAL
title. We can consider ther person neutral
.
If the GroupBy
contains FEMALE
and MALE
title. We can consider ther person unknown
.
If the GroupBy
contains only MALE
title. We can consider the person male
. Similarly for FEMALE
, it would be female
.
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL` = ["Professor", "Prof", "Dr"]
Input:
+--------+--------+
| person| title|
+--------+--------+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+--------+
Output:
+--------+--------+--------+
| person| title| gender|
+--------+--------+--------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr| Neutral|
|SYNTHE07| Prof| Neutral|
|SYNTHE08| Mrs| Unknown|
|SYNTHE08| Prof| Unknown|
|SYNTHE08| Mr| Unknown|
+--------+--------+--------+
Any suggestion and help would be deeply appreciated. Thank you.
答案1
得分: 1
以下是代码的翻译部分:
这将起作用:
MALE = ["先生", "阁下"]
FEMALE = ["女士", "夫人", "女士"]
NEUTRAL = ["教授", "教授", "博士"]
df\
.groupBy("人物")\
.agg(F.collect_list("头衔").alias("头衔"))\
.withColumn("男性", F.array(*[F.lit(x) for x in MALE]))\
.withColumn("女性", F.array(*[F.lit(x) for x in FEMALE]))\
.withColumn("中性", F.array(*[F.lit(x) for x in NEUTRAL]))\
.withColumn("性别", F.when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "未知")
.when((F.arrays_overlap(F.col("头衔"),F.col("男性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "男性")
.when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "女性")
.when(F.arrays_overlap(F.col("头衔"),F.col("中性")), "中性"))\
.withColumn("性别", F.when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("女性"))), "女性")
.when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "男性")
.otherwise(F.col("性别")))\
.selectExpr("人物","explode(头衔) as 头衔","性别")\
.show()
输入和输出部分保持不变。
英文:
This would work:
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL = ["Professor", "Prof", "Dr"]
df\
.groupBy("Person")\
.agg(F.collect_list("Title").alias("Titles"))\
.withColumn("MALE", F.array(*[F.lit(x) for x in MALE]))\
.withColumn("FEMALE", F.array(*[F.lit(x) for x in FEMALE]))\
.withColumn("NEUTRAL", F.array(*[F.lit(x) for x in NEUTRAL]))\
.withColumn("Gender", F.when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Unknown")
.when((F.arrays_overlap(F.col("Titles"),F.col("MALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Male")
.when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Female")
.when(F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL")), "Neutral"))\
.withColumn("Gender", F.when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("FEMALE"))), "Female")
.when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Male")
.otherwise(F.col("Gender")))\
.selectExpr("Person","explode(Titles) as Title","Gender")\
.show()
Input:
+--------+-----+
| Person|Title|
+--------+-----+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+-----+
Output:
+--------+-----+-------+
| Person|Title| Gender|
+--------+-----+-------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr|Neutral|
|SYNTHE07| Prof|Neutral|
|SYNTHE08| Mrs|Unknown|
|SYNTHE08| Prof|Unknown|
|SYNTHE08| Mr|Unknown|
+--------+-----+-------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论