英文:
How to rename the columns inside nested column in pyspark
问题
I want to remove the {} from the column color. <br> I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.
英文:
I have a column product inside which there is a nested column called Color. I want to remove the {} from the column color. <br>
I don't want to flatten the column and rename it. I directly want to rename the column or drop the column.
|-- product: struct (nullable = true)
| |-- {Color}: string (nullable = true)
I have tried dropping it but it doesn't work. I don't want to create a new struct as I have many more nested columns and they are too much.
|-- product: struct (nullable = true)
| |-- {Color}: string (nullable = true)#
答案1
得分: 2
尝试使用**.withField
**来更新字段名称而不进行扁平化。
然后使用**.dropFields
**来从结构中删除嵌套列
。
示例:
#示例json
json = '{"product":{"{Color}":"a"}}'
df = spark.read.json( sc.parallelize([json]))
#使用`.withField`创建Color列并复制`{Color}`的数据
#使用.dropFields删除结构列
df1= df.withColumn("product", df['product'].withField('Color',col('product.`{Color}`'))).\
withColumn("product", col("product").dropFields("`{Color}`"))
df1.printSchema()
df1.show(10,False)
#根
# |-- product: struct (nullable = true)
# | |-- Color: string (nullable = true)
#
#+-------+
#|product|
#+-------+
#|{a} |
#+-------+
英文:
Try with .withField
to update the field name without flattening.
Then use .dropFields
to drop nested columns
from struct.
Example:
#sample json
json = '{"product":{"{Color}":"a"}}'
df = spark.read.json( sc.parallelize([json]))
#create Color column by using `.withField` and copy the `{Color}`data
#use .dropFields to drop struct columns
df1= df.withColumn("product", df['product'].withField('Color',col('product.`{Color}`'))).\
withColumn("product", col("product").dropFields("`{Color}`"))
df1.printSchema()
df1.show(10,False)
#root
# |-- product: struct (nullable = true)
# | |-- Color: string (nullable = true)
#
#+-------+
#|product|
#+-------+
#|{a} |
#+-------+
答案2
得分: 0
- 您可以使用
withColumn
来重命名嵌套列。以下是您可以使用的代码。我有一个与您相同架构的数据帧:
df.printSchema()
- 现在,如下图所示使用
withColumn
,您可以更改嵌套列color
的名称:
from pyspark.sql.functions import col, struct
df.withColumn("product", struct(col("product.{Color}").alias("Color"))).printSchema()
英文:
- You can use
withColumn
to rename the nested column. The following is a code that you can use. I have a dataframe with following schema (same as yours):
df.printSchema()
- Now using
withColumn
as shown in the below image, you can change the name of your nested column color:
from pyspark.sql.functions import col, struct
df.withColumn("product", struct(col("product.{Color}").alias("Color"))).printSchema()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论