创建基于现有列数据的新列。

huangapple go评论97阅读模式
英文:

Create new Column based on the data of existing columns

问题

我明白你的需求。你想要将输入的数据框转换为输出的格式,其中errorColumn应该包含那些不为空的列名,并在两个或更多字段中有值时以 - 作为分隔符。你尝试使用concat但性能不佳。

你可以尝试使用pyspark.sql.functions中的whenconcat_ws函数来实现这个目标。以下是一个示例代码:

  1. from pyspark.sql import SparkSession
  2. from pyspark.sql.functions import when, concat_ws
  3. # 创建Spark会话
  4. spark = SparkSession.builder.appName("example").getOrCreate()
  5. # 你的输入数据框
  6. input_df = spark.createDataFrame([
  7. (1, "XXXXXXXX", None, None, None),
  8. (1, "XXXXXXXX", None, None, None),
  9. (2, "YYYYYYYY", "CAR弟LINE", "ELIZABET弟", None),
  10. (3, "YYYYYYYZ", None, None, "GAL弟"),
  11. (4, "YYYYYYYM", "弟AROLINE", None, None),
  12. (5, "YYYYYYYL", None, None, None)
  13. ], ["sequence", "registerNumber", "first_name", "middle_name", "surname"])
  14. # 使用when和concat_ws创建errorColumn
  15. output_df = input_df.withColumn("errorColumn",
  16. concat_ws("-",
  17. when(input_df["first_name"].isNotNull(), "first_name"),
  18. when(input_df["middle_name"].isNotNull(), "middle_name"),
  19. when(input_df["surname"].isNotNull(), "surname")
  20. )
  21. )
  22. # 显示结果
  23. output_df.show()

这段代码将创建一个新的数据框output_df,其中errorColumn包含了你想要的列名,以 - 作为分隔符,并根据相应的列是否为空来决定是否包括该列名。希望这能帮助你改进性能。

英文:

I have an input dataframe with these columns as below:

  1. +--------+--------------+----------+-----------+-------+
  2. |sequence|registerNumber|first_name|middle_name|surname|
  3. +--------+--------------+----------+-----------+-------+
  4. | 1| XXXXXXXX| | | |
  5. | 1| XXXXXXXX| | | |
  6. | 2| YYYYYYYY| CARLINE| ELIZABET弟| |
  7. | 3| YYYYYYYZ| | | GAL弟|
  8. | 4| YYYYYYYM| AROLINE| | |
  9. | 5| YYYYYYYL| | | |

I want to have a output dataframe like this:

  1. +--------+--------------+----------+-----------+-------+------------
  2. |sequence|registerNumber|first_name|middle_name|surname|errorColumn
  3. +--------+--------------+----------+-----------+-------+-----------
  4. | 1| XXXXXXXX| | | |
  5. | 1| XXXXXXXX| | | |
  6. | 2| YYYYYYYY| CARLINE| ELIZABET | |first_name-middle_name
  7. | 3| YYYYYYYZ| | | GAL弟|surname
  8. | 4| YYYYYYYM| AROLINE| | |first_name
  9. | 5| YYYYYYYL| | | |

The errorColumn should contain the column names(first_name, middle_name, surname) which aren't empty with a separator as - whenever there's value in 2 or more fields

I am trying to do this for list of columns and tried to do this using concat but the performance is poor.

答案1

得分: 2

Using concat_wswhen 并且仅在字符长度大于0(非空字符串)时收集。

  1. from pyspark.sql import functions as F
  2. cols = ['first_name', 'middle_name', 'surname']
  3. df = (df.withColumn("error_column",
  4. F.concat_ws('-',
  5. *[F.when(F.length(F.col(x)) > 0, F.lit(x))
  6. for x in cols])))
英文:

Using concat_ws and when and collect only when character length is more than 0 (not empty string).

  1. from pyspark.sql import functions as F
  2. cols = ['first_name', 'middle_name', 'surname']
  3. df = (df.withColumn("error_column",
  4. F.concat_ws('-',
  5. *[F.when(F.length(F.col(x)) > 0, F.lit(x))
  6. for x in cols])))

huangapple
  • 本文由 发表于 2023年4月13日 22:38:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006737.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定