问题

以下是您要翻译的内容：

"I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.

The example schema is:

val schema = new StructType()
      .add("key", StringType)
      .add(
        "cells",
        ArrayType(
          new StructType()
            .add("family", StringType)
            .add("qualifier", StringType)
            .add("timestamp", LongType)
            .add("nestStruct", new StructType()
                .add("id1", LongType)
                .add("id2", StringType)
                .add("id3", new StructType()
                    .add("id31", LongType)
                    .add("id32", StringType))
        )
      )

val data = Seq(
      Row(
        "1235321863",
        Array(
          Row("a", "b", 1L,  null)
        )
      )
    )


val  df_test = spark
      .createDataFrame(spark.sparkContext.parallelize(data), schema) 

val result = df_test.withColumn(
  "cell1",
  transform($"cells", cell => {
      cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
  }))
result.show(false)
result.printSchema 
result.explain() /*The physical plan shows that if a field is null it will just return null*/

注意：我已经删除了代码部分，只提供了翻译好的内容。

英文:

I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.

The example schema is:

val schema = new StructType()
      .add(&quot;key&quot;, StringType)
      .add(
        &quot;cells&quot;,
        ArrayType(
          new StructType()
            .add(&quot;family&quot;, StringType)
            .add(&quot;qualifier&quot;, StringType)
            .add(&quot;timestamp&quot;, LongType)
            .add(&quot;nestStruct&quot;, new StructType()
                .add(&quot;id1&quot;, LongType)
                .add(&quot;id2&quot;, StringType)
.               .add(&quot;id3&quot;, new StructType()
                   .add(&quot;id31&quot;, LongType)
                   .add(&quot;id32&quot;, StringType))
        )
      )

val data = Seq(
      Row(
        &quot;1235321863&quot;,
        Array(
          Row(&quot;a&quot;, &quot;b&quot;, 1L,  null)
        )
      )
    )

  
   val  df_test = spark
      .createDataFrame(spark.sparkContext.parallelize(data), schema) 

val result = df_test.withColumn(
  &quot;cell1&quot;,
  transform($&quot;cells&quot;, cell =&gt; {
      cell.withField(&quot;nestStruct.id3.id31&quot;, lit(40)) /*This line doesn&#39;t do anything is nestStruct is null. */
  }))
result.show(false)
result.printSchema 
result.explain() /*The physical plan shows that if a field is null it will just return null*/

答案1

得分: 1

你可以使用此问题建议的解决方案：https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe

或者你可以尝试以下方法：
你可以将当前的数据框写入一个JSON文件，将JSON文件读取为字符串，然后尝试使用正则表达式将你想要添加的字段添加到JSON字符串中，然后将JSON字符串写入一个新文件，最后将新文件读取为数据框。

例如，我正在使用上面提供的示例：

import json, re

with open('./pyspark_sandbox_sample.json') as input_file:
    string_data = str(json.load(input_file))
input_file.close()

string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '', 'id33': 40", string_data)

with open('./pyspark_sandbox_sample.json', 'w') as output_file:
    json.dump(eval(string_data), output_file)
output_file.close()

英文:

You can use the solution suggested for this question: https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe

Or you can try the following:
You can write your current dataframe to a json file, read the json file to a string, and try writing a regular expression to add the field you want to the json string, then write the json string to a new file, and read the new file to a dataframe.

For example, I'm using the example provided above:

import json, re

with open(&#39;./pyspark_sandbox_sample.json&#39;) as input_file:
    string_data = str(json.load(input_file))
input_file.close()

string_data = re.sub(r&quot;&#39;id32&#39;: &#39;(.*?)&#39;&quot;, r&quot;&#39;id32&#39;: &#39;&#39;, &#39;id33&#39;: 40&quot;, string_data)

with open(&#39;./pyspark_sandbox_sample.json&#39;, &#39;w&#39;) as output_file:
    json.dump(eval(string_data), output_file)
output_file.close()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更新具有空值的嵌套结构。

问题

答案1

在PySpark中对行进行透视而不进行聚合。

Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

Pyspark UDF 评估

输出的Parquet文件在使用Spark中的列重新分区后非常大。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论