如何在pyspark中重命名嵌套列内的列

huangapple go评论54阅读模式
英文:

How to rename the columns inside nested column in pyspark

问题

I want to remove the {} from the column color. <br> I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.

英文:

I have a column product inside which there is a nested column called Color. I want to remove the {} from the column color. <br>

I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.

|-- product: struct (nullable = true) 
| |-- {Color}: string (nullable = true) 

I have tried dropping it but it doesn't work. I don't want to create a new struct as I have many more nested columns and they are too much.

|-- product: struct (nullable = true)
| |-- {Color}: string (nullable = true)#

答案1

得分: 2

尝试使用**.withField**来更新字段名称而不进行扁平化。

然后使用**.dropFields**来从结构中删除嵌套列

示例:

#示例json
json = '{"product":{"{Color}":"a"}}'
df = spark.read.json( sc.parallelize([json]))

#使用`.withField`创建Color列并复制`{Color}`的数据
#使用.dropFields删除结构列
df1= df.withColumn("product", df['product'].withField('Color',col('product.`{Color}`'))).\
withColumn("product", col("product").dropFields("`{Color}`"))
df1.printSchema()
df1.show(10,False)

#根
# |-- product: struct (nullable = true)
# |    |-- Color: string (nullable = true)
#
#+-------+
#|product|
#+-------+
#|{a}    |
#+-------+
英文:

Try with .withField to update the field name without flattening.

Then use .dropFields to drop nested columns from struct.

Example:

#sample json
json = &#39;{&quot;product&quot;:{&quot;{Color}&quot;:&quot;a&quot;}}&#39;
df = spark.read.json( sc.parallelize([json]))

#create Color column by using `.withField` and copy the `{Color}`data
#use .dropFields to drop struct columns
df1= df.withColumn(&quot;product&quot;, df[&#39;product&#39;].withField(&#39;Color&#39;,col(&#39;product.`{Color}`&#39;))).\
withColumn(&quot;product&quot;, col(&quot;product&quot;).dropFields(&quot;`{Color}`&quot;))
df1.printSchema()
df1.show(10,False)

#root
# |-- product: struct (nullable = true)
# |    |-- Color: string (nullable = true)
#
#+-------+
#|product|
#+-------+
#|{a}      |
#+-------+

答案2

得分: 0

  • 您可以使用 withColumn 来重命名嵌套列。以下是您可以使用的代码。我有一个与您相同架构的数据帧:
df.printSchema()

如何在pyspark中重命名嵌套列内的列

  • 现在,如下图所示使用 withColumn,您可以更改嵌套列 color 的名称:
from pyspark.sql.functions import col, struct

df.withColumn("product", struct(col("product.{Color}").alias("Color"))).printSchema()

如何在pyspark中重命名嵌套列内的列

英文:
  • You can use withColumn to rename the nested column. The following is a code that you can use. I have a dataframe with following schema (same as yours):
df.printSchema()

如何在pyspark中重命名嵌套列内的列

  • Now using withColumn as shown in the below image, you can change the name of your nested column color:
from pyspark.sql.functions import col, struct

df.withColumn(&quot;product&quot;, struct(col(&quot;product.{Color}&quot;).alias(&quot;Color&quot;))).printSchema()

如何在pyspark中重命名嵌套列内的列

huangapple
  • 本文由 发表于 2023年4月19日 14:55:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76051523.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定