将JSON数据规范化为Pandas DataFrame,其中列和值都以列表形式存在

huangapple go评论111阅读模式
英文:

Normalize JSON data to Pandas DataFrame where columns and values are in lists

问题

以下是翻译好的部分:

我有以下复杂的JSON数据需要将其规范化为Pandas DataFrame

{
    "notice": {
        "copyright": "Copyright Info",
        "copyright_url": "This_is_URL"
    },
    "product_info": {
        "refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
        "issuance_time": "20221001T0936Z",
        "product_name": "24hr historic data",
        "ID": "XXXXYYYYZZZZ"
    },
    "data": [
        {
            "station_info": {
                "wmo_id": 95214,
                "station_name": "STATION 1",
                "station_height": 3.8,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        27.2,
                        23.7,
                        81.0,
                        26.0
                    ],
                    [
                        27.3,
                        23.5,
                        80.0,
                        28.0
                    ],
                    [
                        27.4,
                        23.2,
                        78.0,
                        30.0
                    ]
                ]
            }
        },
        {
            "station_info": {
                "wmo_id": 95215,
                "station_name": "STATION 2",
                "station_height": 3.5,
                "station_type": "AWS"
            },
            "observation_data": {
                "columns": [
                    "temp",
                    "dewt",
                    "rh",
                    "wnd_spd_kmh"
                ],
                "data": [
                    [
                        24.2,
                        25.7,
                        82.1,
                        21.0
                    ],
                    [
                        24.3,
                        25.8,
                        79.6,
                        22.0
                    ],
                    [
                        24.4,
                        25.9,
                        78.3,
                        16.0
                    ]
                ]
            }
        }
    ]
}

期望的DataFrame的列在JSON中以"columns"的形式存储实际数据也是如此

期望的输出如下所示

[![enter image description here][2]][2]

我尝试过的代码如下

```python
import pandas as pd
import json
import gzip

with gzip.open(json_file_path, "r") as f:
    data = f.read()
    j = json.loads(data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])

然而,整个"columns"列表被转换为单元格值。

station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 95214 STATION 1 [temp, dewt, rh, wnd_spd_kmh] [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1 95215 STATION 2 [temp, dewt, rh, wnd_spd_kmh] [[24.2, 25.7, 82.1, 21.0],[...],[...]]


[2]: https://i.stack.imgur.com/nWf93.png
英文:

I have the following complex JSON data that needs to be normalised to a Pandas DataFrame.

{
"notice": {
"copyright": "Copyright Info",
"copyright_url": "This_is_URL"
},
"product_info": {
"refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
"issuance_time": "20221001T0936Z",
"product_name": "24hr historic data",
"ID": "XXXXYYYYZZZZ"
},
"data": [
{
"station_info": {
"wmo_id": 95214,
"station_name": "STATION 1",
"station_height": 3.8,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
27.2,
23.7,
81.0,
26.0
],
[
27.3,
23.5,
80.0,
28.0
],
[
27.4,
23.2,
78.0,
30.0
]
]
}
},
{
"station_info": {
"wmo_id": 95215,
"station_name": "STATION 2",
"station_height": 3.5,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
24.2,
25.7,
82.1,
21.0
],
[
24.3,
25.8,
79.6,
22.0
],
[
24.4,
25.9,
78.3,
16.0
]
]
}
}
]
}

The columns of the expected DataFrame is in a list in the JSON as "columns". So are the actual "data".

What is expected to output is as below:

将JSON数据规范化为Pandas DataFrame,其中列和值都以列表形式存在

What I have attempted:

with gzip.open(json_file_path, "r") as f:
data = f.read()
j = json.loads (data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])

However, the whole "columns" list was converted to a cell value.

        station_info.wmo_id    station_info.station_name         observation_data.columns         observation_data.data
0   95214                  STATION 1                         [temp, dewt, rh, wnd_spd_kmh]     [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1   95215                  STATION 2                         [temp, dewt, rh, wnd_spd_kmh]     [[24.2, 25.7, 82.1, 21.0],[...],[...]]

答案1

得分: 1

我不认为仅使用json_normalize能够实现你想要的输出。你可以通过添加explode(来展开json_normalize生成的数组)和pivot(将数据框从长格式转换为宽格式)来实现这个目标。

第一步是展开observation_data.data并重置索引(使用explode和reset_index),以便知道或组织记录组成一个观测。从这里,我们可以再次执行另一个explode(包括列和数据)如下所示。

最后一步是使用第一个explode的索引作为唯一锚点来将结果数据框进行透视,从长格式转换为宽格式。

data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
data_exploded

透视后的数据框如下:

data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'], columns='observation_data.columns', values='observation_data.data').reset_index().drop(columns=['index'])

希望这对你有所帮助。

英文:

I don't think json_normalize alone will be able to achieve the output you wanted. You can achieve this by adding explode (to expand the arrays generated by json_normalize) and pivot (to transform the dataframe from long to wide format).

First step is to flatten the observation_data.data and reset the index (using explode and reset_index) to be able to know or organize group of records as one observation. From here we can again perform another explode (including both the columns and data) as shown below.

Last step now is to pivot the resulting dataframe using the index of the first explode as our unique anchor to transform from long to wide format.

>>> data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
>>> data_exploded
index  station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0       0                95214                 STATION 1                     temp                  27.2
1       0                95214                 STATION 1                     dewt                  23.7
2       0                95214                 STATION 1                       rh                  81.0
3       0                95214                 STATION 1              wnd_spd_kmh                  26.0
4       1                95214                 STATION 1                     temp                  27.3
5       1                95214                 STATION 1                     dewt                  23.5
6       1                95214                 STATION 1                       rh                  80.0
7       1                95214                 STATION 1              wnd_spd_kmh                  28.0
8       2                95214                 STATION 1                     temp                  27.4
9       2                95214                 STATION 1                     dewt                  23.2
10      2                95214                 STATION 1                       rh                  78.0
11      2                95214                 STATION 1              wnd_spd_kmh                  30.0
12      3                95215                 STATION 2                     temp                  24.2
13      3                95215                 STATION 2                     dewt                  25.7
14      3                95215                 STATION 2                       rh                  82.1
15      3                95215                 STATION 2              wnd_spd_kmh                  21.0
16      4                95215                 STATION 2                     temp                  24.3
17      4                95215                 STATION 2                     dewt                  25.8
18      4                95215                 STATION 2                       rh                  79.6
19      4                95215                 STATION 2              wnd_spd_kmh                  22.0
20      5                95215                 STATION 2                     temp                  24.4
21      5                95215                 STATION 2                     dewt                  25.9
22      5                95215                 STATION 2                       rh                  78.3
23      5                95215                 STATION 2              wnd_spd_kmh                  16.0
>>> data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'],columns='observation_data.columns',values='observation_data.data').reset_index().drop(columns=['index'])
observation_data.columns  station_info.wmo_id station_info.station_name  dewt    rh  temp wnd_spd_kmh
0                                       95214                 STATION 1  23.7  81.0  27.2        26.0
1                                       95214                 STATION 1  23.5  80.0  27.3        28.0
2                                       95214                 STATION 1  23.2  78.0  27.4        30.0
3                                       95215                 STATION 2  25.7  82.1  24.2        21.0
4                                       95215                 STATION 2  25.8  79.6  24.3        22.0
5                                       95215                 STATION 2  25.9  78.3  24.4        16.0

huangapple
  • 本文由 发表于 2023年7月13日 22:31:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76680549.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定