英文:
update nested struct with null values
问题
以下是您要翻译的内容:
"I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.
The example schema is:
val schema = new StructType()
.add("key", StringType)
.add(
"cells",
ArrayType(
new StructType()
.add("family", StringType)
.add("qualifier", StringType)
.add("timestamp", LongType)
.add("nestStruct", new StructType()
.add("id1", LongType)
.add("id2", StringType)
.add("id3", new StructType()
.add("id31", LongType)
.add("id32", StringType))
)
)
val data = Seq(
Row(
"1235321863",
Array(
Row("a", "b", 1L, null)
)
)
)
val df_test = spark
.createDataFrame(spark.sparkContext.parallelize(data), schema)
val result = df_test.withColumn(
"cell1",
transform($"cells", cell => {
cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
}))
result.show(false)
result.printSchema
result.explain() /*The physical plan shows that if a field is null it will just return null*/
注意:我已经删除了代码部分,只提供了翻译好的内容。
英文:
I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.
The example schema is:
val schema = new StructType()
.add("key", StringType)
.add(
"cells",
ArrayType(
new StructType()
.add("family", StringType)
.add("qualifier", StringType)
.add("timestamp", LongType)
.add("nestStruct", new StructType()
.add("id1", LongType)
.add("id2", StringType)
. .add("id3", new StructType()
.add("id31", LongType)
.add("id32", StringType))
)
)
val data = Seq(
Row(
"1235321863",
Array(
Row("a", "b", 1L, null)
)
)
)
val df_test = spark
.createDataFrame(spark.sparkContext.parallelize(data), schema)
val result = df_test.withColumn(
"cell1",
transform($"cells", cell => {
cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
}))
result.show(false)
result.printSchema
result.explain() /*The physical plan shows that if a field is null it will just return null*/
答案1
得分: 1
你可以使用此问题建议的解决方案:https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe
或者你可以尝试以下方法:
你可以将当前的数据框写入一个JSON文件,将JSON文件读取为字符串,然后尝试使用正则表达式将你想要添加的字段添加到JSON字符串中,然后将JSON字符串写入一个新文件,最后将新文件读取为数据框。
例如,我正在使用上面提供的示例:
import json, re
with open('./pyspark_sandbox_sample.json') as input_file:
string_data = str(json.load(input_file))
input_file.close()
string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '', 'id33': 40", string_data)
with open('./pyspark_sandbox_sample.json', 'w') as output_file:
json.dump(eval(string_data), output_file)
output_file.close()
英文:
You can use the solution suggested for this question: https://stackoverflow.com/questions/48777993/how-do-i-add-a-column-to-a-nested-struct-in-a-pyspark-dataframe
Or you can try the following:
You can write your current dataframe to a json file, read the json file to a string, and try writing a regular expression to add the field you want to the json string, then write the json string to a new file, and read the new file to a dataframe.
For example, I'm using the example provided above:
import json, re
with open('./pyspark_sandbox_sample.json') as input_file:
string_data = str(json.load(input_file))
input_file.close()
string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '', 'id33': 40", string_data)
with open('./pyspark_sandbox_sample.json', 'w') as output_file:
json.dump(eval(string_data), output_file)
output_file.close()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论