英文:
handling nested Json structure
问题
假设我们有以下的JSON结构:
{
"positions": {
"node": "abc"
},
"submissions": {
"submissionOffsets": [
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
]
}
}
我们想要读取"submissionOffsets",并根据属性名(例如"sample1")提取attributeName和attributeValue。期望的结构如下所示:
匹配的情况下
{
"positions": {
"node": "abc"
},
"submissions": {
"submissionOffsets": [
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
]
},
"attributeName": "sample1",
"attributeValue": 1224
}
不匹配的情况下
{
"positions": {
"node": "abc"
},
"submissions": {
"submissionOffsets": [
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
]
},
"attributeName": null,
"attributeValue": 0.00
}
这需要在数据框中完成。
我尝试使用数据框(dataframes),我展开了submissions.submissionOffsets,然后检查属性名和值,但这只给出了一列,我需要将其与原始数据框连接起来。
英文:
suppose we have following json structure :
{
"positions": {
"node": "abc"
}
"submissions" :{
"submissionOffsets":[
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
}
}
and we want to read "submissionOffsets" and extract attributeName and attributeValue based on attribute name for example "sample1" and expected structure should be in case of Match
{
"positions": {
"node": "abc"
}
"submissions" :{
"submissionOffsets":[
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
},
"attributeName": "sample1",
"attributeValue": 1224
}
**Incase of No Match**
{
"positions": {
"node": "abc"
}
"submissions" :{
"submissionOffsets":[
{
"attributeName": "sample1",
"attributeValue": 1224
},
{
"attributeName": "sample2",
"attributeValue": 1224
},
{
"attributeName": "sample3",
"attributeValue": 1224
},
{
"attributeName": "sample4",
"attributeValue": 1224
}
},
"attributeName": null,
"attributeValue": 0.00
}
This has to be done in dataframes
I was trying with dataframes i exploded submissions.submissionOffsets , then checked for attribute name and value, but this giving one column, i have to join that back to original dataframe.
答案1
得分: 0
filter
是一个高阶函数,用于从嵌套的数组JSON中过滤特定属性。
inline
或inline_outer
用于展开数组值。
以下是示例代码:
val df = spark
.read
.option("multiLine", "true")
.json(Seq(data).toDS)
df.show(false)
+---------+----------------------------------------------------------------------+
|positions|submissions |
+---------+----------------------------------------------------------------------+
|{abc} |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
+---------+----------------------------------------------------------------------+
df.printSchema
root
|-- positions: struct (nullable = true)
| |-- node: string (nullable = true)
|-- submissions: struct (nullable = true)
| |-- submissionOffsets: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- attributeName: string (nullable = true)
| | | |-- attributeValue: long (nullable = true)
// 使用filter高阶函数过滤嵌套的JSON数组值/属性
// inline_outer用于展开数组值
df
.selectExpr(
"*", // 选择数据集/数据框中的所有列
"inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
)
.toJSON
.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
以上是示例代码的翻译结果。
英文:
filter
higher order function to filter specific attribute from nested array json.
inline
or inline_outer
- to explode array values.
Below is sample code
scala> val df = spark
.read
.option("multiLine", "true")
.json(Seq(data).toDS)
df: org.apache.spark.sql.DataFrame = [positions: struct<node: string>, submissions: struct<submissionOffsets: array<struct<attributeName:string,attributeValue:bigint>>>]
scala> df.show(false)
+---------+----------------------------------------------------------------------+
|positions|submissions |
+---------+----------------------------------------------------------------------+
|{abc} |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
+---------+----------------------------------------------------------------------+
scala> df.printSchema
root
|-- positions: struct (nullable = true)
| |-- node: string (nullable = true)
|-- submissions: struct (nullable = true)
| |-- submissionOffsets: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- attributeName: string (nullable = true)
| | | |-- attributeValue: long (nullable = true)
scala>
// filter higher order function to filter nested json array values / attributes
// inline_outer is to explode array values
df
.selectExpr(
"*", // to select all columns from the dataset / dataframe
"inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
)
.toJSON.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论