基于Spark DataFrame中两个特定列的值如何创建新列?

huangapple go评论62阅读模式
英文:

How to create new columns based on values in two specific columns of a Spark DataFrame?

问题

我有一个数据框:

client  type
------  ----
89      id
56      id
34      id
13      id
67      phone
68      phone

我需要基于列 "client" 和 "type" 创建两个新列。当 "type" == "id" 时,将客户号添加到列 "id" 中,当 "type" == "phone" 时,将客户号添加到列 "phone" 中。

我尝试了:

Df.withColumn("id", when($"type" === "id", $"client")).withColumn("phone", when($"type" === "phone", $"client"))

我得到了以下结果:

+--------+----+--------+
|  client|type|id|phone|
+--------+----+--------+
|      89|cuid|89| null|
|      56|cuid|56| null|
|      34|cuid|34| null|
|      13|cuid|13| null|
+--------+-------------+

但是期望的结果是:

+--------+----+----------+
|  client|type|  id|phone|
+--------+----+----------+
|      89|cuid|  89| null|
|      56|cuid|  56| null|
|      34|cuid|  34| null|
|      13|cuid|  13| null|
|      67|cuid|null|   67|
|      68|cuid|null|   68|
+--------+---------------+
英文:

I have dataframe :

client  type
------  ----
89      id
56      id
34      id
13      id
67      phone
68      phone

I need create two new column based on column "client" and "type". Where "type" == "id", then client number to column "id", where "type" == "phone", then client number to column "phone"

I tried:

Df.withColumn("id", when($"type" === "id", $"client")).withColumn("phone", when($"type" === "phone", $"client"))

and I get this result :

+--------+----+--------+
|  client|type|id|phone|
+--------+----+--------+
|      89|cuid|89| null|
|      56|cuid|56| null|
|      34|cuid|34| null|
|      13|cuid|13| null|
+--------+-------------+

but expected result is :

+--------+----+----------+
|  client|type|  id|phone|
+--------+----+----------+
|      89|cuid|  89| null|
|      56|cuid|  56| null|
|      34|cuid|  34| null|
|      13|cuid|  13| null|
|      67|cuid|null|   67|
|      68|cuid|null|   68|
+--------+---------------+

答案1

得分: -1

import pyspark.sql.functions as F

x = [(89, "id"), (56, "id"), (34, "id"), (13, "id"), (67, "phone"), (68, "phone")]

df = (
    spark.createDataFrame(x, schema=["client", "type"])
    .withColumn("id", F.when(F.col("type") == F.lit("id"), F.col("client")))
    .withColumn("phone", F.when(F.col("type") == F.lit("phone"), F.col("client")))
    .show()
)
英文:

You can try something like this:

import pyspark.sql.functions as F

x = [(89, "id"), (56, "id"), (34, "id"), (13, "id"), (67, "phone"), (68, "phone")]

df = (
    spark.createDataFrame(x, schema=["client", "type"])
    .withColumn("id", F.when(F.col("type") == F.lit("id"), F.col("client")))
    .withColumn("phone", F.when(F.col("type") == F.lit("phone"), F.col("client")))
    .show()
)

output:

+------+-----+----+-----+
|client| type|  id|phone|
+------+-----+----+-----+
|    89|   id|  89| null|
|    56|   id|  56| null|
|    34|   id|  34| null|
|    13|   id|  13| null|
|    67|phone|null|   67|
|    68|phone|null|   68|
+------+-----+----+-----+

huangapple
  • 本文由 发表于 2023年5月30日 03:03:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76359801.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定