2023年6月1日 22:18:36go评论84阅读模式

英文:

Partially flatten nested JSON and pivot longer

问题

我有许多具有以下结构的JSON文件：

{
  "requestId": "test",
  "executionDate": "2023-05-10",
  "executionTime": "12:02:22",
  "request": {
    "fields": [{
      "geometry": {
        "type": "Point",
        "coordinates": [-90, 41]
      },
      "colour": "blue",
      "bean": "blaCk",
      "birthday": "2021-01-01",
      "arst": "111",
      "arstg": "rst",
      "fct": {
        "start": "2011-01-10",
        "end": "2012-01-10"
      }
    }]
  },
  "response": {
    "results": [{
      "geom": {
        "type": "geo",
        "coord": [-90, 41]
      },
      "md": {
        "type": "arstat",
        "mdl": "trstr",
        "vs": "v0",
        "cal": {
          "num": 4,
          "comment": "message"
        },
        "bean": ["blue", "green"],
        "result_time": 12342
      },
      "predictions": [{
        "date": "2004-05-19",
        "day": 0,
        "count": 0,
        "eating_stage": "trt"
      }, {
        "date": "2002-01-20",
        "day": 1,
        "count": 0,
        "eating_stage": "arstg"
      }, {
        "date": "2004-05-21",
        "day": 2,
        "count": 0,
        "eating_stage": "strg"
      }, {
        "date": "2004-05-22",
        "day": 3,
        "count": 0,
        "eating_stage": "rst"
      }]
    }]
  }
}

预测部分可能会非常深。我想将此JSON转换为以下结构的CSV：

requestId|executionDate|executionTime|colour|predictions_date|predictions_day|predictions_count|predictions_eating_stage
---|---|---|---|---|---|---|---
test|2023-05-10|12:02:22|blue|2004-05-19|0|0|trt
test|2023-05-10|12:02:22|blue|2002-01-20|1|0|arstg
test|2023-05-10|12:02:22|blue|2004-05-21|2|0|strg
test|2023-05-10|12:02:22|blue|2004-05-22|3|0|rst

我尝试了以下代码：

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

该代码导致每个数据点都变成了一列，我不确定如何在Python中使用JSON函数在'predictions'键处更长时间地进行旋转。我意识到在这个阶段，我可以使用列名进行更长时间地旋转，但我觉得有一种更简洁的方法来实现这一点。

英文:

I have many JSON files with the following structure:

{
  &quot;requestId&quot;: &quot;test&quot;,
  &quot;executionDate&quot;: &quot;2023-05-10&quot;,
  &quot;executionTime&quot;: &quot;12:02:22&quot;,
  &quot;request&quot;: {
    &quot;fields&quot;: [{
      &quot;geometry&quot;: {
        &quot;type&quot;: &quot;Point&quot;,
        &quot;coordinates&quot;: [-90, 41]
      },
      &quot;colour&quot;: &quot;blue&quot;,
      &quot;bean&quot;: &quot;blaCk&quot;,
      &quot;birthday&quot;: &quot;2021-01-01&quot;,
      &quot;arst&quot;: &quot;111&quot;,
      &quot;arstg&quot;: &quot;rst&quot;,
      &quot;fct&quot;: {
        &quot;start&quot;: &quot;2011-01-10&quot;,
        &quot;end&quot;: &quot;2012-01-10&quot;
      }
    }]
  },
  &quot;response&quot;: {
    &quot;results&quot;: [{
        &quot;geom&quot;: {
          &quot;type&quot;: &quot;geo&quot;,
          &quot;coord&quot;: [-90, 41]
        },
        &quot;md&quot;: {
          &quot;type&quot;: &quot;arstat&quot;,
          &quot;mdl&quot;: &quot;trstr&quot;,
          &quot;vs&quot;: &quot;v0&quot;,
          &quot;cal&quot;: {
            &quot;num&quot;: 4,
            &quot;comment&quot;: &quot;message&quot;
          },
          &quot;bean&quot;: [&quot;blue&quot;, &quot;green&quot;],
          &quot;result_time&quot;: 12342
        },
        &quot;predictions&quot;: [{
            &quot;date&quot;: &quot;2004-05-19&quot;,
            &quot;day&quot;: 0,
            &quot;count&quot;: 0,
            &quot;eating_stage&quot;: &quot;trt&quot;
          }, {
            &quot;date&quot;: &quot;2002-01-20&quot;,
            &quot;day&quot;: 1,
            &quot;count&quot;: 0,
            &quot;eating_stage&quot;: &quot;arstg&quot;
          }, {
            &quot;date&quot;: &quot;2004-05-21&quot;,
            &quot;day&quot;: 2,
            &quot;count&quot;: 0,
            &quot;eating_stage&quot;: &quot;strg&quot;
          }, {
            &quot;date&quot;: &quot;2004-05-22&quot;,
            &quot;day&quot;: 3,
            &quot;count&quot;: 0,
            &quot;eating_stage&quot;: &quot;rst&quot;
          }
        }
      }
    }

The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:

requestId	executionDate	executionTime	colour	predictions_date	predictions_day	predictions_eating_stage
test	2023-05-10	12:02:22	blue	2004-05-19	0	trt
test	2023-05-10	12:02:22	blue	2002-01-20	1	astrg
test	2023-05-10	12:02:22	blue	2004-05-21	2	strg
test	2023-05-10	12:02:22	blue	2004-05-22	3	rst

I tried the following code:

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

The code results in every data point becoming a column, and I am not sure how to pivot longer where at the 'predictions' key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.

答案1

得分: 1

我建议简单地提取您所需的内容。这似乎非常具体，不需要使用特定的解析方法来解决。因此，我建议首先创建两个数据框：

df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x: y for x, y in example.items() if type(y) == str}, index=[0])

重命名预测列：

df_prediction.columns = ['prediction_' + x for x in df_prediction]

连接并添加最后的数据部分（颜色）：

output = df_data.assign(colour=example['request']['fields'][0]['colour']).join(df_prediction, how='right').ffill()

输出结果：

  requestId executionDate  ... prediction_count prediction_eating_stage
0      test    2023-05-10  ...                0                     trt
1      test    2023-05-10  ...                0                   arstg
2      test    2023-05-10  ...                0                    strg
3      test    2023-05-10  ...                0                     rst

请注意，这些代码中使用了Python的pandas库，用于数据处理和操作。如果您需要进一步的帮助或解释，请随时告诉我。

英文:

I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:

df_prediction = pd.DataFrame(example[&#39;response&#39;][&#39;results&#39;][0][&#39;predictions&#39;])
df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])

Renaming columns in predictions:

df_prediction.columns = [&#39;prediction_&#39;+x for x in df_prediction]

Joining and adding the last piece of data (colour):

output = df_data.assign(colour = example[&#39;request&#39;][&#39;fields&#39;][0][&#39;colour&#39;]).join(df_prediction,how=&#39;right&#39;).ffill()

Outputting:

  requestId executionDate  ... prediction_count prediction_eating_stage
0      test    2023-05-10  ...                0                     trt
1      test    2023-05-10  ...                0                   arstg
2      test    2023-05-10  ...                0                    strg
3      test    2023-05-10  ...                0                     rst

答案2

得分: 1

你也可以使用json_normalize来提取你想要规范化为csv的记录数组。

df_predictions = pd.json_normalize(json_data, record_path=['response', 'results', 'predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour=json_data['request']['fields'][0]['colour'])
df_predictions

不幸的是，meta字段存在限制，因为它对包含数组/列表的路径抛出异常，所以“colour”列是单独添加的。如果顺序很重要，那么你可以根据需要重新排列列。

英文:

You can also use json_normalize to extract the array of records that you want to normalize into a csv.

&gt;&gt;&gt; df_predictions = pd.json_normalize(json_data,record_path=[&#39;response&#39;, &#39;results&#39;,&#39;predictions&#39;], record_prefix=&#39;predictions.&#39;, meta=[&#39;requestId&#39;, &#39;executionDate&#39;, &#39;executionTime&#39;]).assign(colour = json_data[&#39;request&#39;][&#39;fields&#39;][0][&#39;colour&#39;])
&gt;&gt;&gt; df_predictions
  predictions.date  predictions.day  predictions.count  ... executionDate executionTime colour
0       2004-05-19                0                  0  ...    2023-05-10      12:02:22   blue
1       2002-01-20                1                  0  ...    2023-05-10      12:02:22   blue
2       2004-05-21                2                  0  ...    2023-05-10      12:02:22   blue
3       2004-05-22                3                  0  ...    2023-05-10      12:02:22   blue
[4 rows x 8 columns]

It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

部分展平嵌套的JSON并进行长格式旋转

问题

答案1

答案2

Pandas中特定列数值的3周滚动平均值

Convert CURL POST without body to RESTRequest Delphi Code / Write Delphi Code for Dropbox API /get_current_account endpoint

两个具有相同名称的字段在pydantic模型类中引发问题。

Python类声明自动根据变量名设置名称。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。