英文:
Create new Column based on the data of existing columns
问题
我明白你的需求。你想要将输入的数据框转换为输出的格式,其中errorColumn
应该包含那些不为空的列名,并在两个或更多字段中有值时以 -
作为分隔符。你尝试使用concat
但性能不佳。
你可以尝试使用pyspark.sql.functions
中的when
和concat_ws
函数来实现这个目标。以下是一个示例代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, concat_ws
# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()
# 你的输入数据框
input_df = spark.createDataFrame([
(1, "XXXXXXXX", None, None, None),
(1, "XXXXXXXX", None, None, None),
(2, "YYYYYYYY", "CAR弟LINE", "ELIZABET弟", None),
(3, "YYYYYYYZ", None, None, "GAL弟"),
(4, "YYYYYYYM", "弟AROLINE", None, None),
(5, "YYYYYYYL", None, None, None)
], ["sequence", "registerNumber", "first_name", "middle_name", "surname"])
# 使用when和concat_ws创建errorColumn
output_df = input_df.withColumn("errorColumn",
concat_ws("-",
when(input_df["first_name"].isNotNull(), "first_name"),
when(input_df["middle_name"].isNotNull(), "middle_name"),
when(input_df["surname"].isNotNull(), "surname")
)
)
# 显示结果
output_df.show()
这段代码将创建一个新的数据框output_df
,其中errorColumn
包含了你想要的列名,以 -
作为分隔符,并根据相应的列是否为空来决定是否包括该列名。希望这能帮助你改进性能。
英文:
I have an input dataframe with these columns as below:
+--------+--------------+----------+-----------+-------+
|sequence|registerNumber|first_name|middle_name|surname|
+--------+--------------+----------+-----------+-------+
| 1| XXXXXXXX| | | |
| 1| XXXXXXXX| | | |
| 2| YYYYYYYY| CAR弟LINE| ELIZABET弟| |
| 3| YYYYYYYZ| | | GAL弟|
| 4| YYYYYYYM| 弟AROLINE| | |
| 5| YYYYYYYL| | | |
I want to have a output dataframe like this:
+--------+--------------+----------+-----------+-------+------------
|sequence|registerNumber|first_name|middle_name|surname|errorColumn
+--------+--------------+----------+-----------+-------+-----------
| 1| XXXXXXXX| | | |
| 1| XXXXXXXX| | | |
| 2| YYYYYYYY| CAR弟LINE| ELIZABET弟 | |first_name-middle_name
| 3| YYYYYYYZ| | | GAL弟|surname
| 4| YYYYYYYM| 弟AROLINE| | |first_name
| 5| YYYYYYYL| | | |
The errorColumn
should contain the column names(first_name, middle_name, surname)
which aren't empty with a separator as -
whenever there's value in 2 or more fields
I am trying to do this for list of columns and tried to do this using concat
but the performance is poor.
答案1
得分: 2
Using concat_ws
和 when
并且仅在字符长度大于0(非空字符串)时收集。
from pyspark.sql import functions as F
cols = ['first_name', 'middle_name', 'surname']
df = (df.withColumn("error_column",
F.concat_ws('-',
*[F.when(F.length(F.col(x)) > 0, F.lit(x))
for x in cols])))
英文:
Using concat_ws
and when
and collect only when character length is more than 0 (not empty string).
from pyspark.sql import functions as F
cols = ['first_name', 'middle_name', 'surname']
df = (df.withColumn("error_column",
F.concat_ws('-',
*[F.when(F.length(F.col(x)) > 0, F.lit(x))
for x in cols])))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论