2023年7月13日 22:31:11go评论162阅读模式

英文:

Normalize JSON data to Pandas DataFrame where columns and values are in lists

问题

以下是翻译好的部分：

我有以下复杂的JSON数据，需要将其规范化为Pandas DataFrame。
{
    "notice": {
        "copyright": "Copyright Info",
        "copyright_url": "This_is_URL"
    },
    "product_info": {
        "refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
        "issuance_time": "20221001T0936Z",
        "product_name": "24hr historic data",
        "ID": "XXXXYYYYZZZZ"
    },
    "data": [
        {
            "station_info": {
                "wmo_id": 95214,
                "station_name": "STATION 1",
                "station_height": 3.8,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        27.2,
                        23.7,
                        81.0,
                        26.0
                    ],
                    [
                        27.3,
                        23.5,
                        80.0,
                        28.0
                    ],
                    [
                        27.4,
                        23.2,
                        78.0,
                        30.0
                    ]
                ]
            }
        },
        {
            "station_info": {
                "wmo_id": 95215,
                "station_name": "STATION 2",
                "station_height": 3.5,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        24.2,
                        25.7,
                        82.1,
                        21.0
                    ],
                    [
                        24.3,
                        25.8,
                        79.6,
                        22.0
                    ],
                    [
                        24.4,
                        25.9,
                        78.3,
                        16.0
                    ]
                ]
            }
        }
    ]
}
期望的DataFrame的列在JSON中以"columns"的形式存储，实际数据也是如此。
期望的输出如下所示：
[![enter image description here][2]][2]
我尝试过的代码如下：
```python
import pandas as pd
import json
import gzip
with gzip.open(json_file_path, "r") as f:
    data = f.read()
    j = json.loads(data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])

然而，整个"columns"列表被转换为单元格值。

station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 95214 STATION 1 [temp, dewt, rh, wnd_spd_kmh] [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1 95215 STATION 2 [temp, dewt, rh, wnd_spd_kmh] [[24.2, 25.7, 82.1, 21.0],[...],[...]]


[2]: https://i.stack.imgur.com/nWf93.png

英文:

I have the following complex JSON data that needs to be normalised to a Pandas DataFrame.

{
&quot;notice&quot;: {
&quot;copyright&quot;: &quot;Copyright Info&quot;,
&quot;copyright_url&quot;: &quot;This_is_URL&quot;
},
&quot;product_info&quot;: {
&quot;refresh message&quot;: &quot;Issued at 09:36 UTC Saturday 01/10/22 October 2022&quot;,
&quot;issuance_time&quot;: &quot;20221001T0936Z&quot;,
&quot;product_name&quot;: &quot;24hr historic data&quot;,
&quot;ID&quot;: &quot;XXXXYYYYZZZZ&quot;
},
&quot;data&quot;: [
{
&quot;station_info&quot;: {
&quot;wmo_id&quot;: 95214,
&quot;station_name&quot;: &quot;STATION 1&quot;,
&quot;station_height&quot;: 3.8,
&quot;station_type&quot;: &quot;AWS&quot;
},
&quot;observation_data&quot;: {
&quot;columns&quot;: [
&quot;temp&quot;,
&quot;dewt&quot;,
&quot;rh&quot;,
&quot;wnd_spd_kmh&quot;
],
&quot;data&quot;: [
[
27.2,
23.7,
81.0,
26.0
],
[
27.3,
23.5,
80.0,
28.0
],
[
27.4,
23.2,
78.0,
30.0
]
]
}
},
{
&quot;station_info&quot;: {
&quot;wmo_id&quot;: 95215,
&quot;station_name&quot;: &quot;STATION 2&quot;,
&quot;station_height&quot;: 3.5,
&quot;station_type&quot;: &quot;AWS&quot;
},
&quot;observation_data&quot;: {
&quot;columns&quot;: [
&quot;temp&quot;,
&quot;dewt&quot;,
&quot;rh&quot;,
&quot;wnd_spd_kmh&quot;
],
&quot;data&quot;: [
[
24.2,
25.7,
82.1,
21.0
],
[
24.3,
25.8,
79.6,
22.0
],
[
24.4,
25.9,
78.3,
16.0
]
]
}
}
]
}

The columns of the expected DataFrame is in a list in the JSON as "columns". So are the actual "data".

What is expected to output is as below:

What I have attempted:

with gzip.open(json_file_path, &quot;r&quot;) as f:
data = f.read()
j = json.loads (data.decode(&#39;utf-8&#39;))
national_data = pd.json_normalize(j[&#39;data&#39;])

However, the whole "columns" list was converted to a cell value.

        station_info.wmo_id    station_info.station_name         observation_data.columns         observation_data.data
0   95214                  STATION 1                         [temp, dewt, rh, wnd_spd_kmh]     [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1   95215                  STATION 2                         [temp, dewt, rh, wnd_spd_kmh]     [[24.2, 25.7, 82.1, 21.0],[...],[...]]

答案1

得分: 1

我不认为仅使用json_normalize能够实现你想要的输出。你可以通过添加explode（来展开json_normalize生成的数组）和pivot（将数据框从长格式转换为宽格式）来实现这个目标。

第一步是展开observation_data.data并重置索引（使用explode和reset_index），以便知道或组织记录组成一个观测。从这里，我们可以再次执行另一个explode（包括列和数据）如下所示。

最后一步是使用第一个explode的索引作为唯一锚点来将结果数据框进行透视，从长格式转换为宽格式。

data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
data_exploded

透视后的数据框如下：

data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'], columns='observation_data.columns', values='observation_data.data').reset_index().drop(columns=['index'])

希望这对你有所帮助。

英文:

I don't think json_normalize alone will be able to achieve the output you wanted. You can achieve this by adding explode (to expand the arrays generated by json_normalize) and pivot (to transform the dataframe from long to wide format).

First step is to flatten the observation_data.data and reset the index (using explode and reset_index) to be able to know or organize group of records as one observation. From here we can again perform another explode (including both the columns and data) as shown below.

Last step now is to pivot the resulting dataframe using the index of the first explode as our unique anchor to transform from long to wide format.

&gt;&gt;&gt; data_exploded = national_data.explode(&#39;observation_data.data&#39;).reset_index(drop=True).explode([&#39;observation_data.columns&#39;,&#39;observation_data.data&#39;]).reset_index()
&gt;&gt;&gt; data_exploded
index  station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0       0                95214                 STATION 1                     temp                  27.2
1       0                95214                 STATION 1                     dewt                  23.7
2       0                95214                 STATION 1                       rh                  81.0
3       0                95214                 STATION 1              wnd_spd_kmh                  26.0
4       1                95214                 STATION 1                     temp                  27.3
5       1                95214                 STATION 1                     dewt                  23.5
6       1                95214                 STATION 1                       rh                  80.0
7       1                95214                 STATION 1              wnd_spd_kmh                  28.0
8       2                95214                 STATION 1                     temp                  27.4
9       2                95214                 STATION 1                     dewt                  23.2
10      2                95214                 STATION 1                       rh                  78.0
11      2                95214                 STATION 1              wnd_spd_kmh                  30.0
12      3                95215                 STATION 2                     temp                  24.2
13      3                95215                 STATION 2                     dewt                  25.7
14      3                95215                 STATION 2                       rh                  82.1
15      3                95215                 STATION 2              wnd_spd_kmh                  21.0
16      4                95215                 STATION 2                     temp                  24.3
17      4                95215                 STATION 2                     dewt                  25.8
18      4                95215                 STATION 2                       rh                  79.6
19      4                95215                 STATION 2              wnd_spd_kmh                  22.0
20      5                95215                 STATION 2                     temp                  24.4
21      5                95215                 STATION 2                     dewt                  25.9
22      5                95215                 STATION 2                       rh                  78.3
23      5                95215                 STATION 2              wnd_spd_kmh                  16.0
&gt;&gt;&gt; data_exploded.pivot(index=[&#39;index&#39;, &#39;station_info.wmo_id&#39;, &#39;station_info.station_name&#39;],columns=&#39;observation_data.columns&#39;,values=&#39;observation_data.data&#39;).reset_index().drop(columns=[&#39;index&#39;])
observation_data.columns  station_info.wmo_id station_info.station_name  dewt    rh  temp wnd_spd_kmh
0                                       95214                 STATION 1  23.7  81.0  27.2        26.0
1                                       95214                 STATION 1  23.5  80.0  27.3        28.0
2                                       95214                 STATION 1  23.2  78.0  27.4        30.0
3                                       95215                 STATION 2  25.7  82.1  24.2        21.0
4                                       95215                 STATION 2  25.8  79.6  24.3        22.0
5                                       95215                 STATION 2  25.9  78.3  24.4        16.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将JSON数据规范化为Pandas DataFrame，其中列和值都以列表形式存在

问题

答案1

在Flask中格式化Python爬取的输出。

向字典中的列表添加数值

有没有更简单或更有效的方法来找到算法的平均运行时间？

Google Slides – 使用 API 在网页链接中发布一篇文章

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。