创建基于现有列数据的新列。

huangapple go评论62阅读模式
英文:

Create new Column based on the data of existing columns

问题

我明白你的需求。你想要将输入的数据框转换为输出的格式,其中errorColumn应该包含那些不为空的列名,并在两个或更多字段中有值时以 - 作为分隔符。你尝试使用concat但性能不佳。

你可以尝试使用pyspark.sql.functions中的whenconcat_ws函数来实现这个目标。以下是一个示例代码:

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, concat_ws

# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()

# 你的输入数据框
input_df = spark.createDataFrame([
    (1, "XXXXXXXX", None, None, None),
    (1, "XXXXXXXX", None, None, None),
    (2, "YYYYYYYY", "CAR弟LINE", "ELIZABET弟", None),
    (3, "YYYYYYYZ", None, None, "GAL弟"),
    (4, "YYYYYYYM", "弟AROLINE", None, None),
    (5, "YYYYYYYL", None, None, None)
], ["sequence", "registerNumber", "first_name", "middle_name", "surname"])

# 使用when和concat_ws创建errorColumn
output_df = input_df.withColumn("errorColumn",
    concat_ws("-",
        when(input_df["first_name"].isNotNull(), "first_name"),
        when(input_df["middle_name"].isNotNull(), "middle_name"),
        when(input_df["surname"].isNotNull(), "surname")
    )
)

# 显示结果
output_df.show()

这段代码将创建一个新的数据框output_df,其中errorColumn包含了你想要的列名,以 - 作为分隔符,并根据相应的列是否为空来决定是否包括该列名。希望这能帮助你改进性能。

英文:

I have an input dataframe with these columns as below:

+--------+--------------+----------+-----------+-------+
|sequence|registerNumber|first_name|middle_name|surname|
+--------+--------------+----------+-----------+-------+
|       1|      XXXXXXXX|          |           |       |
|       1|      XXXXXXXX|          |           |       |
|       2|      YYYYYYYY| CAR弟LINE| ELIZABET弟|       |
|       3|      YYYYYYYZ|          |           |  GAL弟|
|       4|      YYYYYYYM| 弟AROLINE|           |       |
|       5|      YYYYYYYL|          |           |       |

I want to have a output dataframe like this:

+--------+--------------+----------+-----------+-------+------------
|sequence|registerNumber|first_name|middle_name|surname|errorColumn
+--------+--------------+----------+-----------+-------+-----------
|       1|      XXXXXXXX|          |           |       |
|       1|      XXXXXXXX|          |           |       |
|       2|      YYYYYYYY| CAR弟LINE| ELIZABET弟 |       |first_name-middle_name
|       3|      YYYYYYYZ|          |           |  GAL弟|surname
|       4|      YYYYYYYM| 弟AROLINE|           |       |first_name
|       5|      YYYYYYYL|          |           |       |

The errorColumn should contain the column names(first_name, middle_name, surname) which aren't empty with a separator as - whenever there's value in 2 or more fields

I am trying to do this for list of columns and tried to do this using concat but the performance is poor.

答案1

得分: 2

Using concat_wswhen 并且仅在字符长度大于0(非空字符串)时收集。

from pyspark.sql import functions as F

cols = ['first_name', 'middle_name', 'surname']

df = (df.withColumn("error_column", 
                    F.concat_ws('-', 
                                *[F.when(F.length(F.col(x)) > 0, F.lit(x)) 
                                  for x in cols])))
英文:

Using concat_ws and when and collect only when character length is more than 0 (not empty string).

from pyspark.sql import functions as F

cols = ['first_name', 'middle_name', 'surname']

df = (df.withColumn("error_column", 
                    F.concat_ws('-', 
                                *[F.when(F.length(F.col(x)) > 0, F.lit(x)) 
                                  for x in cols])))

huangapple
  • 本文由 发表于 2023年4月13日 22:38:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006737.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定