如何在pyspark中重命名嵌套列内的列

huangapple go评论95阅读模式
英文:

How to rename the columns inside nested column in pyspark

问题

I want to remove the {} from the column color. <br> I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.

英文:

I have a column product inside which there is a nested column called Color. I want to remove the {} from the column color. <br>

I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.

  1. |-- product: struct (nullable = true)
  2. | |-- {Color}: string (nullable = true)

I have tried dropping it but it doesn't work. I don't want to create a new struct as I have many more nested columns and they are too much.

  1. |-- product: struct (nullable = true)
  2. | |-- {Color}: string (nullable = true)#

答案1

得分: 2

尝试使用**.withField**来更新字段名称而不进行扁平化。

然后使用**.dropFields**来从结构中删除嵌套列

示例:

  1. #示例json
  2. json = '{"product":{"{Color}":"a"}}'
  3. df = spark.read.json( sc.parallelize([json]))
  4. #使用`.withField`创建Color列并复制`{Color}`的数据
  5. #使用.dropFields删除结构列
  6. df1= df.withColumn("product", df['product'].withField('Color',col('product.`{Color}`'))).\
  7. withColumn("product", col("product").dropFields("`{Color}`"))
  8. df1.printSchema()
  9. df1.show(10,False)
  10. #根
  11. # |-- product: struct (nullable = true)
  12. # | |-- Color: string (nullable = true)
  13. #
  14. #+-------+
  15. #|product|
  16. #+-------+
  17. #|{a} |
  18. #+-------+
英文:

Try with .withField to update the field name without flattening.

Then use .dropFields to drop nested columns from struct.

Example:

  1. #sample json
  2. json = &#39;{&quot;product&quot;:{&quot;{Color}&quot;:&quot;a&quot;}}&#39;
  3. df = spark.read.json( sc.parallelize([json]))
  4. #create Color column by using `.withField` and copy the `{Color}`data
  5. #use .dropFields to drop struct columns
  6. df1= df.withColumn(&quot;product&quot;, df[&#39;product&#39;].withField(&#39;Color&#39;,col(&#39;product.`{Color}`&#39;))).\
  7. withColumn(&quot;product&quot;, col(&quot;product&quot;).dropFields(&quot;`{Color}`&quot;))
  8. df1.printSchema()
  9. df1.show(10,False)
  10. #root
  11. # |-- product: struct (nullable = true)
  12. # | |-- Color: string (nullable = true)
  13. #
  14. #+-------+
  15. #|product|
  16. #+-------+
  17. #|{a} |
  18. #+-------+

答案2

得分: 0

  • 您可以使用 withColumn 来重命名嵌套列。以下是您可以使用的代码。我有一个与您相同架构的数据帧:
  1. df.printSchema()

如何在pyspark中重命名嵌套列内的列

  • 现在,如下图所示使用 withColumn,您可以更改嵌套列 color 的名称:
  1. from pyspark.sql.functions import col, struct
  2. df.withColumn("product", struct(col("product.{Color}").alias("Color"))).printSchema()

如何在pyspark中重命名嵌套列内的列

英文:
  • You can use withColumn to rename the nested column. The following is a code that you can use. I have a dataframe with following schema (same as yours):
  1. df.printSchema()

如何在pyspark中重命名嵌套列内的列

  • Now using withColumn as shown in the below image, you can change the name of your nested column color:
  1. from pyspark.sql.functions import col, struct
  2. df.withColumn(&quot;product&quot;, struct(col(&quot;product.{Color}&quot;).alias(&quot;Color&quot;))).printSchema()

如何在pyspark中重命名嵌套列内的列

huangapple
  • 本文由 发表于 2023年4月19日 14:55:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76051523.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定