处理嵌套的 JSON 结构

huangapple go评论118阅读模式
英文:

handling nested Json structure

问题

假设我们有以下的JSON结构:

  1. {
  2. "positions": {
  3. "node": "abc"
  4. },
  5. "submissions": {
  6. "submissionOffsets": [
  7. {
  8. "attributeName": "sample1",
  9. "attributeValue": 1224
  10. },
  11. {
  12. "attributeName": "sample2",
  13. "attributeValue": 1224
  14. },
  15. {
  16. "attributeName": "sample3",
  17. "attributeValue": 1224
  18. },
  19. {
  20. "attributeName": "sample4",
  21. "attributeValue": 1224
  22. }
  23. ]
  24. }
  25. }

我们想要读取"submissionOffsets",并根据属性名(例如"sample1")提取attributeName和attributeValue。期望的结构如下所示:
匹配的情况下

  1. {
  2. "positions": {
  3. "node": "abc"
  4. },
  5. "submissions": {
  6. "submissionOffsets": [
  7. {
  8. "attributeName": "sample1",
  9. "attributeValue": 1224
  10. },
  11. {
  12. "attributeName": "sample2",
  13. "attributeValue": 1224
  14. },
  15. {
  16. "attributeName": "sample3",
  17. "attributeValue": 1224
  18. },
  19. {
  20. "attributeName": "sample4",
  21. "attributeValue": 1224
  22. }
  23. ]
  24. },
  25. "attributeName": "sample1",
  26. "attributeValue": 1224
  27. }

不匹配的情况下

  1. {
  2. "positions": {
  3. "node": "abc"
  4. },
  5. "submissions": {
  6. "submissionOffsets": [
  7. {
  8. "attributeName": "sample1",
  9. "attributeValue": 1224
  10. },
  11. {
  12. "attributeName": "sample2",
  13. "attributeValue": 1224
  14. },
  15. {
  16. "attributeName": "sample3",
  17. "attributeValue": 1224
  18. },
  19. {
  20. "attributeName": "sample4",
  21. "attributeValue": 1224
  22. }
  23. ]
  24. },
  25. "attributeName": null,
  26. "attributeValue": 0.00
  27. }

这需要在数据框中完成。

我尝试使用数据框(dataframes),我展开了submissions.submissionOffsets,然后检查属性名和值,但这只给出了一列,我需要将其与原始数据框连接起来。

英文:

suppose we have following json structure :

  1. {
  2. "positions": {
  3. "node": "abc"
  4. }
  5. "submissions" :{
  6. "submissionOffsets":[
  7. {
  8. "attributeName": "sample1",
  9. "attributeValue": 1224
  10. },
  11. {
  12. "attributeName": "sample2",
  13. "attributeValue": 1224
  14. },
  15. {
  16. "attributeName": "sample3",
  17. "attributeValue": 1224
  18. },
  19. {
  20. "attributeName": "sample4",
  21. "attributeValue": 1224
  22. }
  23. }
  24. }

and we want to read "submissionOffsets" and extract attributeName and attributeValue based on attribute name for example "sample1" and expected structure should be in case of Match

  1. {
  2. "positions": {
  3. "node": "abc"
  4. }
  5. "submissions" :{
  6. "submissionOffsets":[
  7. {
  8. "attributeName": "sample1",
  9. "attributeValue": 1224
  10. },
  11. {
  12. "attributeName": "sample2",
  13. "attributeValue": 1224
  14. },
  15. {
  16. "attributeName": "sample3",
  17. "attributeValue": 1224
  18. },
  19. {
  20. "attributeName": "sample4",
  21. "attributeValue": 1224
  22. }
  23. },
  24. "attributeName": "sample1",
  25. "attributeValue": 1224
  26. }
  27. **Incase of No Match**
  28. {
  29. "positions": {
  30. "node": "abc"
  31. }
  32. "submissions" :{
  33. "submissionOffsets":[
  34. {
  35. "attributeName": "sample1",
  36. "attributeValue": 1224
  37. },
  38. {
  39. "attributeName": "sample2",
  40. "attributeValue": 1224
  41. },
  42. {
  43. "attributeName": "sample3",
  44. "attributeValue": 1224
  45. },
  46. {
  47. "attributeName": "sample4",
  48. "attributeValue": 1224
  49. }
  50. },
  51. "attributeName": null,
  52. "attributeValue": 0.00
  53. }

This has to be done in dataframes

I was trying with dataframes i exploded submissions.submissionOffsets , then checked for attribute name and value, but this giving one column, i have to join that back to original dataframe.

答案1

得分: 0

filter是一个高阶函数,用于从嵌套的数组JSON中过滤特定属性。

inlineinline_outer用于展开数组值。

以下是示例代码:

  1. val df = spark
  2. .read
  3. .option("multiLine", "true")
  4. .json(Seq(data).toDS)
  5. df.show(false)
  6. +---------+----------------------------------------------------------------------+
  7. |positions|submissions |
  8. +---------+----------------------------------------------------------------------+
  9. |{abc} |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
  10. +---------+----------------------------------------------------------------------+
  11. df.printSchema
  12. root
  13. |-- positions: struct (nullable = true)
  14. | |-- node: string (nullable = true)
  15. |-- submissions: struct (nullable = true)
  16. | |-- submissionOffsets: array (nullable = true)
  17. | | |-- element: struct (containsNull = true)
  18. | | | |-- attributeName: string (nullable = true)
  19. | | | |-- attributeValue: long (nullable = true)
  20. // 使用filter高阶函数过滤嵌套的JSON数组值/属性
  21. // inline_outer用于展开数组值
  22. df
  23. .selectExpr(
  24. "*", // 选择数据集/数据框中的所有列
  25. "inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
  26. )
  27. .toJSON
  28. .show(false)
  29. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  30. |value |
  31. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  32. |{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
  33. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

以上是示例代码的翻译结果。

英文:

filter higher order function to filter specific attribute from nested array json.

inline or inline_outer - to explode array values.

Below is sample code

  1. scala> val df = spark
  2. .read
  3. .option("multiLine", "true")
  4. .json(Seq(data).toDS)
  5. df: org.apache.spark.sql.DataFrame = [positions: struct<node: string>, submissions: struct<submissionOffsets: array<struct<attributeName:string,attributeValue:bigint>>>]
  6. scala> df.show(false)
  7. +---------+----------------------------------------------------------------------+
  8. |positions|submissions |
  9. +---------+----------------------------------------------------------------------+
  10. |{abc} |{[{sample1, 1224}, {sample2, 1224}, {sample3, 1224}, {sample4, 1224}]}|
  11. +---------+----------------------------------------------------------------------+
  12. scala> df.printSchema
  13. root
  14. |-- positions: struct (nullable = true)
  15. | |-- node: string (nullable = true)
  16. |-- submissions: struct (nullable = true)
  17. | |-- submissionOffsets: array (nullable = true)
  18. | | |-- element: struct (containsNull = true)
  19. | | | |-- attributeName: string (nullable = true)
  20. | | | |-- attributeValue: long (nullable = true)
  21. scala>
  22. // filter higher order function to filter nested json array values / attributes
  23. // inline_outer is to explode array values
  24. df
  25. .selectExpr(
  26. "*", // to select all columns from the dataset / dataframe
  27. "inline_outer(filter(submissions.submissionOffsets, i -> i.attributeName == 'sample1')) as (attributeName, attributeValue)"
  28. )
  29. .toJSON.show(false)
  30. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  31. |value |
  32. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  33. |{"positions":{"node":"abc"},"submissions":{"submissionOffsets":[{"attributeName":"sample1","attributeValue":1224},{"attributeName":"sample2","attributeValue":1224},{"attributeName":"sample3","attributeValue":1224},{"attributeName":"sample4","attributeValue":1224}]},"attributeName":"sample1","attributeValue":1224}|
  34. +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

huangapple
  • 本文由 发表于 2023年8月8日 20:21:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76859520.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定