更新具有空值的嵌套结构。

huangapple go评论65阅读模式
英文:

update nested struct with null values

问题

以下是您要翻译的内容:

"I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.

The example schema is:

val schema = new StructType()
      .add("key", StringType)
      .add(
        "cells",
        ArrayType(
          new StructType()
            .add("family", StringType)
            .add("qualifier", StringType)
            .add("timestamp", LongType)
            .add("nestStruct", new StructType()
                .add("id1", LongType)
                .add("id2", StringType)
                .add("id3", new StructType()
                    .add("id31", LongType)
                    .add("id32", StringType))
        )
      )

val data = Seq(
      Row(
        "1235321863",
        Array(
          Row("a", "b", 1L,  null)
        )
      )
    )


val  df_test = spark
      .createDataFrame(spark.sparkContext.parallelize(data), schema) 

val result = df_test.withColumn(
  "cell1",
  transform($"cells", cell => {
      cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
  }))
result.show(false)
result.printSchema 
result.explain() /*The physical plan shows that if a field is null it will just return null*/

注意:我已经删除了代码部分,只提供了翻译好的内容。

英文:

I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.

The example schema is:

val schema = new StructType()
      .add("key", StringType)
      .add(
        "cells",
        ArrayType(
          new StructType()
            .add("family", StringType)
            .add("qualifier", StringType)
            .add("timestamp", LongType)
            .add("nestStruct", new StructType()
                .add("id1", LongType)
                .add("id2", StringType)
.               .add("id3", new StructType()
                   .add("id31", LongType)
                   .add("id32", StringType))
        )
      )

val data = Seq(
      Row(
        "1235321863",
        Array(
          Row("a", "b", 1L,  null)
        )
      )
    )

  
   val  df_test = spark
      .createDataFrame(spark.sparkContext.parallelize(data), schema) 

val result = df_test.withColumn(
  "cell1",
  transform($"cells", cell => {
      cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
  }))
result.show(false)
result.printSchema 
result.explain() /*The physical plan shows that if a field is null it will just return null*/

答案1

得分: 1

你可以使用此问题建议的解决方案:https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe

或者你可以尝试以下方法:
你可以将当前的数据框写入一个JSON文件,将JSON文件读取为字符串,然后尝试使用正则表达式将你想要添加的字段添加到JSON字符串中,然后将JSON字符串写入一个新文件,最后将新文件读取为数据框。

例如,我正在使用上面提供的示例:

import json, re

with open('./pyspark_sandbox_sample.json') as input_file:
    string_data = str(json.load(input_file))
input_file.close()

string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '', 'id33': 40", string_data)

with open('./pyspark_sandbox_sample.json', 'w') as output_file:
    json.dump(eval(string_data), output_file)
output_file.close()
英文:

You can use the solution suggested for this question: https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe

Or you can try the following:
You can write your current dataframe to a json file, read the json file to a string, and try writing a regular expression to add the field you want to the json string, then write the json string to a new file, and read the new file to a dataframe.

For example, I'm using the example provided above:

import json, re

with open('./pyspark_sandbox_sample.json') as input_file:
    string_data = str(json.load(input_file))
input_file.close()

string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '', 'id33': 40", string_data)

with open('./pyspark_sandbox_sample.json', 'w') as output_file:
    json.dump(eval(string_data), output_file)
output_file.close()

huangapple
  • 本文由 发表于 2023年2月18日 07:11:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490013.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定