部分展平嵌套的JSON并进行长格式旋转

huangapple go评论84阅读模式
英文:

Partially flatten nested JSON and pivot longer

问题

我有许多具有以下结构的JSON文件:

  1. {
  2. "requestId": "test",
  3. "executionDate": "2023-05-10",
  4. "executionTime": "12:02:22",
  5. "request": {
  6. "fields": [{
  7. "geometry": {
  8. "type": "Point",
  9. "coordinates": [-90, 41]
  10. },
  11. "colour": "blue",
  12. "bean": "blaCk",
  13. "birthday": "2021-01-01",
  14. "arst": "111",
  15. "arstg": "rst",
  16. "fct": {
  17. "start": "2011-01-10",
  18. "end": "2012-01-10"
  19. }
  20. }]
  21. },
  22. "response": {
  23. "results": [{
  24. "geom": {
  25. "type": "geo",
  26. "coord": [-90, 41]
  27. },
  28. "md": {
  29. "type": "arstat",
  30. "mdl": "trstr",
  31. "vs": "v0",
  32. "cal": {
  33. "num": 4,
  34. "comment": "message"
  35. },
  36. "bean": ["blue", "green"],
  37. "result_time": 12342
  38. },
  39. "predictions": [{
  40. "date": "2004-05-19",
  41. "day": 0,
  42. "count": 0,
  43. "eating_stage": "trt"
  44. }, {
  45. "date": "2002-01-20",
  46. "day": 1,
  47. "count": 0,
  48. "eating_stage": "arstg"
  49. }, {
  50. "date": "2004-05-21",
  51. "day": 2,
  52. "count": 0,
  53. "eating_stage": "strg"
  54. }, {
  55. "date": "2004-05-22",
  56. "day": 3,
  57. "count": 0,
  58. "eating_stage": "rst"
  59. }]
  60. }]
  61. }
  62. }

预测部分可能会非常深。我想将此JSON转换为以下结构的CSV:

  1. requestId|executionDate|executionTime|colour|predictions_date|predictions_day|predictions_count|predictions_eating_stage
  2. ---|---|---|---|---|---|---|---
  3. test|2023-05-10|12:02:22|blue|2004-05-19|0|0|trt
  4. test|2023-05-10|12:02:22|blue|2002-01-20|1|0|arstg
  5. test|2023-05-10|12:02:22|blue|2004-05-21|2|0|strg
  6. test|2023-05-10|12:02:22|blue|2004-05-22|3|0|rst

我尝试了以下代码:

  1. flat_json = pd.DataFrame(
  2. flatten(json_data), index=[0]
  3. )

该代码导致每个数据点都变成了一列,我不确定如何在Python中使用JSON函数在'predictions'键处更长时间地进行旋转。我意识到在这个阶段,我可以使用列名进行更长时间地旋转,但我觉得有一种更简洁的方法来实现这一点。

英文:

I have many JSON files with the following structure:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

  1. {
  2. &quot;requestId&quot;: &quot;test&quot;,
  3. &quot;executionDate&quot;: &quot;2023-05-10&quot;,
  4. &quot;executionTime&quot;: &quot;12:02:22&quot;,
  5. &quot;request&quot;: {
  6. &quot;fields&quot;: [{
  7. &quot;geometry&quot;: {
  8. &quot;type&quot;: &quot;Point&quot;,
  9. &quot;coordinates&quot;: [-90, 41]
  10. },
  11. &quot;colour&quot;: &quot;blue&quot;,
  12. &quot;bean&quot;: &quot;blaCk&quot;,
  13. &quot;birthday&quot;: &quot;2021-01-01&quot;,
  14. &quot;arst&quot;: &quot;111&quot;,
  15. &quot;arstg&quot;: &quot;rst&quot;,
  16. &quot;fct&quot;: {
  17. &quot;start&quot;: &quot;2011-01-10&quot;,
  18. &quot;end&quot;: &quot;2012-01-10&quot;
  19. }
  20. }]
  21. },
  22. &quot;response&quot;: {
  23. &quot;results&quot;: [{
  24. &quot;geom&quot;: {
  25. &quot;type&quot;: &quot;geo&quot;,
  26. &quot;coord&quot;: [-90, 41]
  27. },
  28. &quot;md&quot;: {
  29. &quot;type&quot;: &quot;arstat&quot;,
  30. &quot;mdl&quot;: &quot;trstr&quot;,
  31. &quot;vs&quot;: &quot;v0&quot;,
  32. &quot;cal&quot;: {
  33. &quot;num&quot;: 4,
  34. &quot;comment&quot;: &quot;message&quot;
  35. },
  36. &quot;bean&quot;: [&quot;blue&quot;, &quot;green&quot;],
  37. &quot;result_time&quot;: 12342
  38. },
  39. &quot;predictions&quot;: [{
  40. &quot;date&quot;: &quot;2004-05-19&quot;,
  41. &quot;day&quot;: 0,
  42. &quot;count&quot;: 0,
  43. &quot;eating_stage&quot;: &quot;trt&quot;
  44. }, {
  45. &quot;date&quot;: &quot;2002-01-20&quot;,
  46. &quot;day&quot;: 1,
  47. &quot;count&quot;: 0,
  48. &quot;eating_stage&quot;: &quot;arstg&quot;
  49. }, {
  50. &quot;date&quot;: &quot;2004-05-21&quot;,
  51. &quot;day&quot;: 2,
  52. &quot;count&quot;: 0,
  53. &quot;eating_stage&quot;: &quot;strg&quot;
  54. }, {
  55. &quot;date&quot;: &quot;2004-05-22&quot;,
  56. &quot;day&quot;: 3,
  57. &quot;count&quot;: 0,
  58. &quot;eating_stage&quot;: &quot;rst&quot;
  59. }
  60. }
  61. }
  62. }

<!-- end snippet -->

The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:

requestId executionDate executionTime colour predictions_date predictions_day predictions_count predictions_eating_stage
test 2023-05-10 12:02:22 blue 2004-05-19 0 0 trt
test 2023-05-10 12:02:22 blue 2002-01-20 1 0 astrg
test 2023-05-10 12:02:22 blue 2004-05-21 2 0 strg
test 2023-05-10 12:02:22 blue 2004-05-22 3 0 rst

I tried the following code:

  1. flat_json = pd.DataFrame(
  2. flatten(json_data), index=[0]
  3. )

The code results in every data point becoming a column, and I am not sure how to pivot longer where at the 'predictions' key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.

答案1

得分: 1

我建议简单地提取您所需的内容。这似乎非常具体,不需要使用特定的解析方法来解决。因此,我建议首先创建两个数据框:

  1. df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
  2. df_data = pd.DataFrame({x: y for x, y in example.items() if type(y) == str}, index=[0])

重命名预测列:

  1. df_prediction.columns = ['prediction_' + x for x in df_prediction]

连接并添加最后的数据部分(颜色):

  1. output = df_data.assign(colour=example['request']['fields'][0]['colour']).join(df_prediction, how='right').ffill()

输出结果:

  1. requestId executionDate ... prediction_count prediction_eating_stage
  2. 0 test 2023-05-10 ... 0 trt
  3. 1 test 2023-05-10 ... 0 arstg
  4. 2 test 2023-05-10 ... 0 strg
  5. 3 test 2023-05-10 ... 0 rst

请注意,这些代码中使用了Python的pandas库,用于数据处理和操作。如果您需要进一步的帮助或解释,请随时告诉我。

英文:

I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:

  1. df_prediction = pd.DataFrame(example[&#39;response&#39;][&#39;results&#39;][0][&#39;predictions&#39;])
  2. df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])

Renaming columns in predictions:

  1. df_prediction.columns = [&#39;prediction_&#39;+x for x in df_prediction]

Joining and adding the last piece of data (colour):

  1. output = df_data.assign(colour = example[&#39;request&#39;][&#39;fields&#39;][0][&#39;colour&#39;]).join(df_prediction,how=&#39;right&#39;).ffill()

Outputting:

  1. requestId executionDate ... prediction_count prediction_eating_stage
  2. 0 test 2023-05-10 ... 0 trt
  3. 1 test 2023-05-10 ... 0 arstg
  4. 2 test 2023-05-10 ... 0 strg
  5. 3 test 2023-05-10 ... 0 rst

答案2

得分: 1

你也可以使用json_normalize来提取你想要规范化为csv的记录数组。

  1. df_predictions = pd.json_normalize(json_data, record_path=['response', 'results', 'predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour=json_data['request']['fields'][0]['colour'])
  2. df_predictions

不幸的是,meta字段存在限制,因为它对包含数组/列表的路径抛出异常,所以“colour”列是单独添加的。如果顺序很重要,那么你可以根据需要重新排列列。

英文:

You can also use json_normalize to extract the array of records that you want to normalize into a csv.

  1. &gt;&gt;&gt; df_predictions = pd.json_normalize(json_data,record_path=[&#39;response&#39;, &#39;results&#39;,&#39;predictions&#39;], record_prefix=&#39;predictions.&#39;, meta=[&#39;requestId&#39;, &#39;executionDate&#39;, &#39;executionTime&#39;]).assign(colour = json_data[&#39;request&#39;][&#39;fields&#39;][0][&#39;colour&#39;])
  2. &gt;&gt;&gt; df_predictions
  3. predictions.date predictions.day predictions.count ... executionDate executionTime colour
  4. 0 2004-05-19 0 0 ... 2023-05-10 12:02:22 blue
  5. 1 2002-01-20 1 0 ... 2023-05-10 12:02:22 blue
  6. 2 2004-05-21 2 0 ... 2023-05-10 12:02:22 blue
  7. 3 2004-05-22 3 0 ... 2023-05-10 12:02:22 blue
  8. [4 rows x 8 columns]

It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.

huangapple
  • 本文由 发表于 2023年6月1日 22:18:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382888.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定