英文:
How to create new columns based on values in two specific columns of a Spark DataFrame?
问题
我有一个数据框:
client type
------ ----
89 id
56 id
34 id
13 id
67 phone
68 phone
我需要基于列 "client" 和 "type" 创建两个新列。当 "type" == "id" 时,将客户号添加到列 "id" 中,当 "type" == "phone" 时,将客户号添加到列 "phone" 中。
我尝试了:
Df.withColumn("id", when($"type" === "id", $"client")).withColumn("phone", when($"type" === "phone", $"client"))
我得到了以下结果:
+--------+----+--------+
| client|type|id|phone|
+--------+----+--------+
| 89|cuid|89| null|
| 56|cuid|56| null|
| 34|cuid|34| null|
| 13|cuid|13| null|
+--------+-------------+
但是期望的结果是:
+--------+----+----------+
| client|type| id|phone|
+--------+----+----------+
| 89|cuid| 89| null|
| 56|cuid| 56| null|
| 34|cuid| 34| null|
| 13|cuid| 13| null|
| 67|cuid|null| 67|
| 68|cuid|null| 68|
+--------+---------------+
英文:
I have dataframe :
client type
------ ----
89 id
56 id
34 id
13 id
67 phone
68 phone
I need create two new column based on column "client" and "type". Where "type" == "id", then client number to column "id", where "type" == "phone", then client number to column "phone"
I tried:
Df.withColumn("id", when($"type" === "id", $"client")).withColumn("phone", when($"type" === "phone", $"client"))
and I get this result :
+--------+----+--------+
| client|type|id|phone|
+--------+----+--------+
| 89|cuid|89| null|
| 56|cuid|56| null|
| 34|cuid|34| null|
| 13|cuid|13| null|
+--------+-------------+
but expected result is :
+--------+----+----------+
| client|type| id|phone|
+--------+----+----------+
| 89|cuid| 89| null|
| 56|cuid| 56| null|
| 34|cuid| 34| null|
| 13|cuid| 13| null|
| 67|cuid|null| 67|
| 68|cuid|null| 68|
+--------+---------------+
答案1
得分: -1
import pyspark.sql.functions as F
x = [(89, "id"), (56, "id"), (34, "id"), (13, "id"), (67, "phone"), (68, "phone")]
df = (
spark.createDataFrame(x, schema=["client", "type"])
.withColumn("id", F.when(F.col("type") == F.lit("id"), F.col("client")))
.withColumn("phone", F.when(F.col("type") == F.lit("phone"), F.col("client")))
.show()
)
英文:
You can try something like this:
import pyspark.sql.functions as F
x = [(89, "id"), (56, "id"), (34, "id"), (13, "id"), (67, "phone"), (68, "phone")]
df = (
spark.createDataFrame(x, schema=["client", "type"])
.withColumn("id", F.when(F.col("type") == F.lit("id"), F.col("client")))
.withColumn("phone", F.when(F.col("type") == F.lit("phone"), F.col("client")))
.show()
)
output:
+------+-----+----+-----+
|client| type| id|phone|
+------+-----+----+-----+
| 89| id| 89| null|
| 56| id| 56| null|
| 34| id| 34| null|
| 13| id| 13| null|
| 67|phone|null| 67|
| 68|phone|null| 68|
+------+-----+----+-----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论