处理嵌套的 JSON 结构

huangapple go评论78阅读模式
英文:

handling nested Json structure

问题

假设我们有以下的JSON结构:

{
	"positions": {
		"node": "abc"
	},
	"submissions": {
		"submissionOffsets": [
			{
				"attributeName": "sample1",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample2",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample3",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample4",
				"attributeValue": 1224
			}
		]
	}
}

我们想要读取"submissionOffsets",并根据属性名(例如"sample1")提取attributeName和attributeValue。期望的结构如下所示:
匹配的情况下

{
	"positions": {
		"node": "abc"
	},
	"submissions": {
		"submissionOffsets": [
			{
				"attributeName": "sample1",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample2",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample3",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample4",
				"attributeValue": 1224
			}
		]
	},
	"attributeName": "sample1",
	"attributeValue": 1224
}

不匹配的情况下

{
	"positions": {
		"node": "abc"
	},
	"submissions": {
		"submissionOffsets": [
			{
				"attributeName": "sample1",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample2",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample3",
				"attributeValue": 1224
			},
			{
				"attributeName": "sample4",
				"attributeValue": 1224
			}
		]
	},
	"attributeName": null,
	"attributeValue": 0.00
}

这需要在数据框中完成。

我尝试使用数据框(dataframes),我展开了submissions.submissionOffsets,然后检查属性名和值,但这只给出了一列,我需要将其与原始数据框连接起来。

英文:

suppose we have following json structure :

{
	"positions": {
		"node": "abc"
	}
	"submissions" :{
		"submissionOffsets":[
		{
			"attributeName": "sample1",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample2",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample3",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample4",
			"attributeValue": 1224
		}
		
	}
}

and we want to read "submissionOffsets" and extract attributeName and attributeValue based on attribute name for example "sample1" and expected structure should be in case of Match

{
	"positions": {
		"node": "abc"
	}
	"submissions" :{
		"submissionOffsets":[
		{
			"attributeName": "sample1",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample2",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample3",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample4",
			"attributeValue": 1224
		}
		
	},
"attributeName": "sample1",
"attributeValue": 1224
}


**Incase of No Match**

{
	"positions": {
		"node": "abc"
	}
	"submissions" :{
		"submissionOffsets":[
		{
			"attributeName": "sample1",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample2",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample3",
			"attributeValue": 1224
		},
		{
			"attributeName": "sample4",
			"attributeValue": 1224
		}
		
	},
"attributeName": null,
"attributeValue": 0.00
}

This has to be done in dataframes

I was trying with dataframes i exploded submissions.submissionOffsets , then checked for attribute name and value, but this giving one column, i have to join that back to original dataframe.

答案1

得分: 0

filter是一个高阶函数,用于从嵌套的数组JSON中过滤特定属性。

inlineinline_outer用于展开数组值。

以下是示例代码:

val df = spark
  .read
  .option("multiLine", "true")
  .json(Seq(data).toDS)

df.show(false)
+---------+----------------------------------------------------------------------+
|positions|submissions                                                           |
+---------+----------------------------------------------------------------------+
|{abc}    |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
+---------+----------------------------------------------------------------------+

df.printSchema
root
 |-- positions: struct (nullable = true)
 |    |-- node: string (nullable = true)
 |-- submissions: struct (nullable = true)
 |    |-- submissionOffsets: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- attributeName: string (nullable = true)
 |    |    |    |-- attributeValue: long (nullable = true)

// 使用filter高阶函数过滤嵌套的JSON数组值/属性
// inline_outer用于展开数组值

df
  .selectExpr(
    "*", // 选择数据集/数据框中的所有列
    "inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
  )
  .toJSON
  .show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

以上是示例代码的翻译结果。

英文:

filter higher order function to filter specific attribute from nested array json.

inline or inline_outer - to explode array values.

Below is sample code

scala> val df = spark
.read
.option("multiLine", "true")
.json(Seq(data).toDS)
df: org.apache.spark.sql.DataFrame = [positions: struct<node: string>, submissions: struct<submissionOffsets: array<struct<attributeName:string,attributeValue:bigint>>>]
scala> df.show(false)
+---------+----------------------------------------------------------------------+
|positions|submissions                                                           |
+---------+----------------------------------------------------------------------+
|{abc}    |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
+---------+----------------------------------------------------------------------+
scala> df.printSchema
root
|-- positions: struct (nullable = true)
|    |-- node: string (nullable = true)
|-- submissions: struct (nullable = true)
|    |-- submissionOffsets: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- attributeName: string (nullable = true)
|    |    |    |-- attributeValue: long (nullable = true)
scala> 
// filter higher order function to filter nested json array values / attributes
// inline_outer is to explode array values
df
.selectExpr(
"*", // to select all columns from the dataset / dataframe
"inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
)
.toJSON.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

huangapple
  • 本文由 发表于 2023年8月8日 20:21:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76859520.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定