英文:
Deriving value of new column based on Group Pyspak
问题
以下是您要翻译的内容:
I have a use case where I want to derive the gender of a person by doing GroupBy.
If the GroupBy contains MALE and NEUTRAL title. We can consider ther person male.
If the GroupBy contains FEMALE and NEUTRAL title. We can consider ther person female.
If the GroupBy contains only NEUTRAL title. We can consider ther person neutral.
If the GroupBy contains FEMALE and MALE title. We can consider ther person unknown.
If the GroupBy contains only MALE title. We can consider the person male. Similarly for FEMALE, it would be female.
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL` = ["Professor", "Prof", "Dr"]
Input:
+--------+--------+
| person| title|
+--------+--------+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+--------+
Output:
+--------+--------+--------+
| person| title| gender|
+--------+--------+--------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr| Neutral|
|SYNTHE07| Prof| Neutral|
|SYNTHE08| Mrs| Unknown|
|SYNTHE08| Prof| Unknown|
|SYNTHE08| Mr| Unknown|
+--------+--------+--------+
英文:
I have a use case where I want to derive the gender of a person by doing GroupBy.
If the GroupBy contains MALE and NEUTRAL title. We can consider ther person male.
If the GroupBy contains FEMALE and NEUTRAL title. We can consider ther person female.
If the GroupBy contains only NEUTRAL title. We can consider ther person neutral.
If the GroupBy contains FEMALE and MALE title. We can consider ther person unknown.
If the GroupBy contains only MALE title. We can consider the person male. Similarly for FEMALE, it would be female.
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL` = ["Professor", "Prof", "Dr"]
Input:
+--------+--------+
| person| title|
+--------+--------+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+--------+
Output:
+--------+--------+--------+
| person| title| gender|
+--------+--------+--------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr| Neutral|
|SYNTHE07| Prof| Neutral|
|SYNTHE08| Mrs| Unknown|
|SYNTHE08| Prof| Unknown|
|SYNTHE08| Mr| Unknown|
+--------+--------+--------+
Any suggestion and help would be deeply appreciated. Thank you.
答案1
得分: 1
以下是代码的翻译部分:
这将起作用:
MALE = ["先生", "阁下"]
FEMALE = ["女士", "夫人", "女士"]
NEUTRAL = ["教授", "教授", "博士"]
df\
.groupBy("人物")\
.agg(F.collect_list("头衔").alias("头衔"))\
.withColumn("男性", F.array(*[F.lit(x) for x in MALE]))\
.withColumn("女性", F.array(*[F.lit(x) for x in FEMALE]))\
.withColumn("中性", F.array(*[F.lit(x) for x in NEUTRAL]))\
.withColumn("性别", F.when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "未知")
.when((F.arrays_overlap(F.col("头衔"),F.col("男性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "男性")
.when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "女性")
.when(F.arrays_overlap(F.col("头衔"),F.col("中性")), "中性"))\
.withColumn("性别", F.when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("女性"))), "女性")
.when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "男性")
.otherwise(F.col("性别")))\
.selectExpr("人物","explode(头衔) as 头衔","性别")\
.show()
输入和输出部分保持不变。
英文:
This would work:
MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL = ["Professor", "Prof", "Dr"]
df\
.groupBy("Person")\
.agg(F.collect_list("Title").alias("Titles"))\
.withColumn("MALE", F.array(*[F.lit(x) for x in MALE]))\
.withColumn("FEMALE", F.array(*[F.lit(x) for x in FEMALE]))\
.withColumn("NEUTRAL", F.array(*[F.lit(x) for x in NEUTRAL]))\
.withColumn("Gender", F.when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Unknown")
.when((F.arrays_overlap(F.col("Titles"),F.col("MALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Male")
.when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Female")
.when(F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL")), "Neutral"))\
.withColumn("Gender", F.when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("FEMALE"))), "Female")
.when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Male")
.otherwise(F.col("Gender")))\
.selectExpr("Person","explode(Titles) as Title","Gender")\
.show()
Input:
+--------+-----+
| Person|Title|
+--------+-----+
|SYNTHE02| Mr|
|SYNTHE02| Dr|
|SYNTHE03| Mr|
|SYNTHE03| Mr|
|SYNTHE05| Mrs|
|SYNTHE05| Ms|
|SYNTHE05| Ms|
|SYNTHE01| Mrs|
|SYNTHE01| Dr|
|SYNTHE01| Ms|
|SYNTHE07| Dr|
|SYNTHE07| Prof|
|SYNTHE08| Mrs|
|SYNTHE08| Prof|
|SYNTHE08| Mr|
+--------+-----+
Output:
+--------+-----+-------+
| Person|Title| Gender|
+--------+-----+-------+
|SYNTHE02| Mr| Male|
|SYNTHE02| Dr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE03| Mr| Male|
|SYNTHE05| Mrs| Female|
|SYNTHE05| Ms| Female|
|SYNTHE05| Ms| Female|
|SYNTHE01| Mrs| Female|
|SYNTHE01| Dr| Female|
|SYNTHE01| Ms| Female|
|SYNTHE07| Dr|Neutral|
|SYNTHE07| Prof|Neutral|
|SYNTHE08| Mrs|Unknown|
|SYNTHE08| Prof|Unknown|
|SYNTHE08| Mr|Unknown|
+--------+-----+-------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论