根据 Group Pyspak 推导新列的值

huangapple go评论74阅读模式
英文:

Deriving value of new column based on Group Pyspak

问题

以下是您要翻译的内容:

I have a use case where I want to derive the gender of a person by doing GroupBy.

If the GroupBy contains MALE and NEUTRAL title. We can consider ther person male.

If the GroupBy contains FEMALE and NEUTRAL title. We can consider ther person female.

If the GroupBy contains only NEUTRAL title. We can consider ther person neutral.

If the GroupBy contains FEMALE and MALE title. We can consider ther person unknown.

If the GroupBy contains only MALE title. We can consider the person male. Similarly for FEMALE, it would be female.

MALE = ["Mr", "Lord"]

FEMALE = ["Ms", "Mrs", "Lady"]

NEUTRAL` = ["Professor", "Prof", "Dr"]

Input:

+--------+--------+
|  person|   title|
+--------+--------+
|SYNTHE02|      Mr|
|SYNTHE02|      Dr|
|SYNTHE03|      Mr|
|SYNTHE03|      Mr|
|SYNTHE05|     Mrs|
|SYNTHE05|      Ms|
|SYNTHE05|      Ms|
|SYNTHE01|     Mrs|
|SYNTHE01|      Dr|
|SYNTHE01|      Ms|
|SYNTHE07|      Dr|
|SYNTHE07|    Prof|
|SYNTHE08|     Mrs|
|SYNTHE08|    Prof|
|SYNTHE08|      Mr|
+--------+--------+

Output:

+--------+--------+--------+
|  person|   title|  gender|
+--------+--------+--------+
|SYNTHE02|      Mr|    Male|
|SYNTHE02|      Dr|    Male|
|SYNTHE03|      Mr|    Male|
|SYNTHE03|      Mr|    Male|
|SYNTHE05|     Mrs|  Female|
|SYNTHE05|      Ms|  Female|
|SYNTHE05|      Ms|  Female|
|SYNTHE01|     Mrs|  Female|
|SYNTHE01|      Dr|  Female|
|SYNTHE01|      Ms|  Female|
|SYNTHE07|      Dr| Neutral|
|SYNTHE07|    Prof| Neutral|
|SYNTHE08|     Mrs| Unknown|
|SYNTHE08|    Prof| Unknown|
|SYNTHE08|      Mr| Unknown|
+--------+--------+--------+
英文:

I have a use case where I want to derive the gender of a person by doing GroupBy.

If the GroupBy contains MALE and NEUTRAL title. We can consider ther person male.

If the GroupBy contains FEMALE and NEUTRAL title. We can consider ther person female.

If the GroupBy contains only NEUTRAL title. We can consider ther person neutral.

If the GroupBy contains FEMALE and MALE title. We can consider ther person unknown.

If the GroupBy contains only MALE title. We can consider the person male. Similarly for FEMALE, it would be female.

MALE = ["Mr", "Lord"]

FEMALE = ["Ms", "Mrs", "Lady"]

NEUTRAL` = ["Professor", "Prof", "Dr"]

Input:

+--------+--------+
|  person|   title|
+--------+--------+
|SYNTHE02|      Mr|
|SYNTHE02|      Dr|
|SYNTHE03|      Mr|
|SYNTHE03|      Mr|
|SYNTHE05|     Mrs|
|SYNTHE05|      Ms|
|SYNTHE05|      Ms|
|SYNTHE01|     Mrs|
|SYNTHE01|      Dr|
|SYNTHE01|      Ms|
|SYNTHE07|      Dr|
|SYNTHE07|    Prof|
|SYNTHE08|     Mrs|
|SYNTHE08|    Prof|
|SYNTHE08|      Mr|
+--------+--------+

Output:

+--------+--------+--------+
|  person|   title|  gender|
+--------+--------+--------+
|SYNTHE02|      Mr|    Male|
|SYNTHE02|      Dr|    Male|
|SYNTHE03|      Mr|    Male|
|SYNTHE03|      Mr|    Male|
|SYNTHE05|     Mrs|  Female|
|SYNTHE05|      Ms|  Female|
|SYNTHE05|      Ms|  Female|
|SYNTHE01|     Mrs|  Female|
|SYNTHE01|      Dr|  Female|
|SYNTHE01|      Ms|  Female|
|SYNTHE07|      Dr| Neutral|
|SYNTHE07|    Prof| Neutral|
|SYNTHE08|     Mrs| Unknown|
|SYNTHE08|    Prof| Unknown|
|SYNTHE08|      Mr| Unknown|
+--------+--------+--------+

Any suggestion and help would be deeply appreciated. Thank you.

答案1

得分: 1

以下是代码的翻译部分:

这将起作用

    MALE = ["先生", "阁下"]
    FEMALE = ["女士", "夫人", "女士"]
    NEUTRAL = ["教授", "教授", "博士"]
    
    df\
    .groupBy("人物")\
    .agg(F.collect_list("头衔").alias("头衔"))\
    .withColumn("男性", F.array(*[F.lit(x) for x in MALE]))\
    .withColumn("女性", F.array(*[F.lit(x) for x in FEMALE]))\
    .withColumn("中性", F.array(*[F.lit(x) for x in NEUTRAL]))\
    .withColumn("性别", F.when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "未知")
                 .when((F.arrays_overlap(F.col("头衔"),F.col("男性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "男性")
                .when((F.arrays_overlap(F.col("头衔"),F.col("女性")) & F.arrays_overlap(F.col("头衔"),F.col("中性"))), "女性")
                .when(F.arrays_overlap(F.col("头衔"),F.col("中性")), "中性"))\
    .withColumn("性别", F.when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("女性"))), "女性")
                 .when((F.col("性别").isNull() & F.arrays_overlap(F.col("头衔"),F.col("男性"))), "男性")
                .otherwise(F.col("性别")))\
    .selectExpr("人物","explode(头衔) as 头衔","性别")\
    .show()

输入和输出部分保持不变。

英文:

This would work:

MALE = ["Mr", "Lord"]
FEMALE = ["Ms", "Mrs", "Lady"]
NEUTRAL = ["Professor", "Prof", "Dr"]

df\
.groupBy("Person")\
.agg(F.collect_list("Title").alias("Titles"))\
.withColumn("MALE", F.array(*[F.lit(x) for x in MALE]))\
.withColumn("FEMALE", F.array(*[F.lit(x) for x in FEMALE]))\
.withColumn("NEUTRAL", F.array(*[F.lit(x) for x in NEUTRAL]))\
.withColumn("Gender", F.when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Unknown")
             .when((F.arrays_overlap(F.col("Titles"),F.col("MALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Male")
            .when((F.arrays_overlap(F.col("Titles"),F.col("FEMALE")) & F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL"))), "Female")
            .when(F.arrays_overlap(F.col("Titles"),F.col("NEUTRAL")), "Neutral"))\
.withColumn("Gender", F.when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("FEMALE"))), "Female")
             .when((F.col("Gender").isNull() & F.arrays_overlap(F.col("Titles"),F.col("MALE"))), "Male")
            .otherwise(F.col("Gender")))\
.selectExpr("Person","explode(Titles) as Title","Gender")\
.show()

Input:

+--------+-----+
|  Person|Title|
+--------+-----+
|SYNTHE02|   Mr|
|SYNTHE02|   Dr|
|SYNTHE03|   Mr|
|SYNTHE03|   Mr|
|SYNTHE05|  Mrs|
|SYNTHE05|   Ms|
|SYNTHE05|   Ms|
|SYNTHE01|  Mrs|
|SYNTHE01|   Dr|
|SYNTHE01|   Ms|
|SYNTHE07|   Dr|
|SYNTHE07| Prof|
|SYNTHE08|  Mrs|
|SYNTHE08| Prof|
|SYNTHE08|   Mr|
+--------+-----+

Output:

+--------+-----+-------+
|  Person|Title| Gender|
+--------+-----+-------+
|SYNTHE02|   Mr|   Male|
|SYNTHE02|   Dr|   Male|
|SYNTHE03|   Mr|   Male|
|SYNTHE03|   Mr|   Male|
|SYNTHE05|  Mrs| Female|
|SYNTHE05|   Ms| Female|
|SYNTHE05|   Ms| Female|
|SYNTHE01|  Mrs| Female|
|SYNTHE01|   Dr| Female|
|SYNTHE01|   Ms| Female|
|SYNTHE07|   Dr|Neutral|
|SYNTHE07| Prof|Neutral|
|SYNTHE08|  Mrs|Unknown|
|SYNTHE08| Prof|Unknown|
|SYNTHE08|   Mr|Unknown|
+--------+-----+-------+

huangapple
  • 本文由 发表于 2023年5月10日 21:17:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218924.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定