2023年4月13日 22:38:03go评论97阅读模式

英文:

Create new Column based on the data of existing columns

问题

我明白你的需求。你想要将输入的数据框转换为输出的格式，其中errorColumn应该包含那些不为空的列名，并在两个或更多字段中有值时以 - 作为分隔符。你尝试使用concat但性能不佳。

你可以尝试使用pyspark.sql.functions中的when和concat_ws函数来实现这个目标。以下是一个示例代码：

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, concat_ws
# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()
# 你的输入数据框
input_df = spark.createDataFrame([
    (1, "XXXXXXXX", None, None, None),
    (1, "XXXXXXXX", None, None, None),
    (2, "YYYYYYYY", "CAR弟LINE", "ELIZABET弟", None),
    (3, "YYYYYYYZ", None, None, "GAL弟"),
    (4, "YYYYYYYM", "弟AROLINE", None, None),
    (5, "YYYYYYYL", None, None, None)
], ["sequence", "registerNumber", "first_name", "middle_name", "surname"])
# 使用when和concat_ws创建errorColumn
output_df = input_df.withColumn("errorColumn",
    concat_ws("-",
        when(input_df["first_name"].isNotNull(), "first_name"),
        when(input_df["middle_name"].isNotNull(), "middle_name"),
        when(input_df["surname"].isNotNull(), "surname")
    )
)
# 显示结果
output_df.show()

这段代码将创建一个新的数据框output_df，其中errorColumn包含了你想要的列名，以 - 作为分隔符，并根据相应的列是否为空来决定是否包括该列名。希望这能帮助你改进性能。

英文:

I have an input dataframe with these columns as below:

+--------+--------------+----------+-----------+-------+
|sequence|registerNumber|first_name|middle_name|surname|
+--------+--------------+----------+-----------+-------+
|       1|      XXXXXXXX|          |           |       |
|       1|      XXXXXXXX|          |           |       |
|       2|      YYYYYYYY| CAR弟LINE| ELIZABET弟|       |
|       3|      YYYYYYYZ|          |           |  GAL弟|
|       4|      YYYYYYYM| 弟AROLINE|           |       |
|       5|      YYYYYYYL|          |           |       |

I want to have a output dataframe like this:

+--------+--------------+----------+-----------+-------+------------
|sequence|registerNumber|first_name|middle_name|surname|errorColumn
+--------+--------------+----------+-----------+-------+-----------
|       1|      XXXXXXXX|          |           |       |
|       1|      XXXXXXXX|          |           |       |
|       2|      YYYYYYYY| CAR弟LINE| ELIZABET弟 |       |first_name-middle_name
|       3|      YYYYYYYZ|          |           |  GAL弟|surname
|       4|      YYYYYYYM| 弟AROLINE|           |       |first_name
|       5|      YYYYYYYL|          |           |       |

The errorColumn should contain the column names(first_name, middle_name, surname) which aren't empty with a separator as - whenever there's value in 2 or more fields

I am trying to do this for list of columns and tried to do this using concat but the performance is poor.

答案1

得分: 2

Using concat_ws 和 when 并且仅在字符长度大于0（非空字符串）时收集。

from pyspark.sql import functions as F
cols = ['first_name', 'middle_name', 'surname']
df = (df.withColumn("error_column", 
                    F.concat_ws('-', 
                                *[F.when(F.length(F.col(x)) > 0, F.lit(x)) 
                                  for x in cols])))

英文:

Using concat_ws and when and collect only when character length is more than 0 (not empty string).

from pyspark.sql import functions as F
cols = [&#39;first_name&#39;, &#39;middle_name&#39;, &#39;surname&#39;]
df = (df.withColumn(&quot;error_column&quot;, 
                    F.concat_ws(&#39;-&#39;, 
                                *[F.when(F.length(F.col(x)) &gt; 0, F.lit(x)) 
                                  for x in cols])))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建基于现有列数据的新列。

问题

答案1

Python struct反向解包

我可以模拟 Python 测试中的 sqlite3 CURRENT_TIMESTAMP 吗？

Flask TypeError: 预期的字节

在轴值之间放置标签并添加第二个Y轴

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。