英文:
Normalize JSON data to Pandas DataFrame where columns and values are in lists
问题
以下是翻译好的部分:
我有以下复杂的JSON数据,需要将其规范化为Pandas DataFrame。
{
"notice": {
"copyright": "Copyright Info",
"copyright_url": "This_is_URL"
},
"product_info": {
"refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
"issuance_time": "20221001T0936Z",
"product_name": "24hr historic data",
"ID": "XXXXYYYYZZZZ"
},
"data": [
{
"station_info": {
"wmo_id": 95214,
"station_name": "STATION 1",
"station_height": 3.8,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
27.2,
23.7,
81.0,
26.0
],
[
27.3,
23.5,
80.0,
28.0
],
[
27.4,
23.2,
78.0,
30.0
]
]
}
},
{
"station_info": {
"wmo_id": 95215,
"station_name": "STATION 2",
"station_height": 3.5,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
24.2,
25.7,
82.1,
21.0
],
[
24.3,
25.8,
79.6,
22.0
],
[
24.4,
25.9,
78.3,
16.0
]
]
}
}
]
}
期望的DataFrame的列在JSON中以"columns"的形式存储,实际数据也是如此。
期望的输出如下所示:
[![enter image description here][2]][2]
我尝试过的代码如下:
```python
import pandas as pd
import json
import gzip
with gzip.open(json_file_path, "r") as f:
data = f.read()
j = json.loads(data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])
然而,整个"columns"列表被转换为单元格值。
station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 95214 STATION 1 [temp, dewt, rh, wnd_spd_kmh] [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1 95215 STATION 2 [temp, dewt, rh, wnd_spd_kmh] [[24.2, 25.7, 82.1, 21.0],[...],[...]]
[2]: https://i.stack.imgur.com/nWf93.png
英文:
I have the following complex JSON data that needs to be normalised to a Pandas DataFrame.
{
"notice": {
"copyright": "Copyright Info",
"copyright_url": "This_is_URL"
},
"product_info": {
"refresh message": "Issued at 09:36 UTC Saturday 01/10/22 October 2022",
"issuance_time": "20221001T0936Z",
"product_name": "24hr historic data",
"ID": "XXXXYYYYZZZZ"
},
"data": [
{
"station_info": {
"wmo_id": 95214,
"station_name": "STATION 1",
"station_height": 3.8,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
27.2,
23.7,
81.0,
26.0
],
[
27.3,
23.5,
80.0,
28.0
],
[
27.4,
23.2,
78.0,
30.0
]
]
}
},
{
"station_info": {
"wmo_id": 95215,
"station_name": "STATION 2",
"station_height": 3.5,
"station_type": "AWS"
},
"observation_data": {
"columns": [
"temp",
"dewt",
"rh",
"wnd_spd_kmh"
],
"data": [
[
24.2,
25.7,
82.1,
21.0
],
[
24.3,
25.8,
79.6,
22.0
],
[
24.4,
25.9,
78.3,
16.0
]
]
}
}
]
}
The columns of the expected DataFrame is in a list in the JSON as "columns". So are the actual "data".
What is expected to output is as below:
What I have attempted:
with gzip.open(json_file_path, "r") as f:
data = f.read()
j = json.loads (data.decode('utf-8'))
national_data = pd.json_normalize(j['data'])
However, the whole "columns" list was converted to a cell value.
station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 95214 STATION 1 [temp, dewt, rh, wnd_spd_kmh] [[27.2, 23.7, 81.0, 26.0],[...],[...]]
1 95215 STATION 2 [temp, dewt, rh, wnd_spd_kmh] [[24.2, 25.7, 82.1, 21.0],[...],[...]]
答案1
得分: 1
我不认为仅使用json_normalize能够实现你想要的输出。你可以通过添加explode(来展开json_normalize生成的数组)和pivot(将数据框从长格式转换为宽格式)来实现这个目标。
第一步是展开observation_data.data并重置索引(使用explode和reset_index),以便知道或组织记录组成一个观测。从这里,我们可以再次执行另一个explode(包括列和数据)如下所示。
最后一步是使用第一个explode的索引作为唯一锚点来将结果数据框进行透视,从长格式转换为宽格式。
data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
data_exploded
透视后的数据框如下:
data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'], columns='observation_data.columns', values='observation_data.data').reset_index().drop(columns=['index'])
希望这对你有所帮助。
英文:
I don't think json_normalize alone will be able to achieve the output you wanted. You can achieve this by adding explode (to expand the arrays generated by json_normalize) and pivot (to transform the dataframe from long to wide format).
First step is to flatten the observation_data.data and reset the index (using explode and reset_index) to be able to know or organize group of records as one observation. From here we can again perform another explode (including both the columns and data) as shown below.
Last step now is to pivot the resulting dataframe using the index of the first explode as our unique anchor to transform from long to wide format.
>>> data_exploded = national_data.explode('observation_data.data').reset_index(drop=True).explode(['observation_data.columns','observation_data.data']).reset_index()
>>> data_exploded
index station_info.wmo_id station_info.station_name observation_data.columns observation_data.data
0 0 95214 STATION 1 temp 27.2
1 0 95214 STATION 1 dewt 23.7
2 0 95214 STATION 1 rh 81.0
3 0 95214 STATION 1 wnd_spd_kmh 26.0
4 1 95214 STATION 1 temp 27.3
5 1 95214 STATION 1 dewt 23.5
6 1 95214 STATION 1 rh 80.0
7 1 95214 STATION 1 wnd_spd_kmh 28.0
8 2 95214 STATION 1 temp 27.4
9 2 95214 STATION 1 dewt 23.2
10 2 95214 STATION 1 rh 78.0
11 2 95214 STATION 1 wnd_spd_kmh 30.0
12 3 95215 STATION 2 temp 24.2
13 3 95215 STATION 2 dewt 25.7
14 3 95215 STATION 2 rh 82.1
15 3 95215 STATION 2 wnd_spd_kmh 21.0
16 4 95215 STATION 2 temp 24.3
17 4 95215 STATION 2 dewt 25.8
18 4 95215 STATION 2 rh 79.6
19 4 95215 STATION 2 wnd_spd_kmh 22.0
20 5 95215 STATION 2 temp 24.4
21 5 95215 STATION 2 dewt 25.9
22 5 95215 STATION 2 rh 78.3
23 5 95215 STATION 2 wnd_spd_kmh 16.0
>>> data_exploded.pivot(index=['index', 'station_info.wmo_id', 'station_info.station_name'],columns='observation_data.columns',values='observation_data.data').reset_index().drop(columns=['index'])
observation_data.columns station_info.wmo_id station_info.station_name dewt rh temp wnd_spd_kmh
0 95214 STATION 1 23.7 81.0 27.2 26.0
1 95214 STATION 1 23.5 80.0 27.3 28.0
2 95214 STATION 1 23.2 78.0 27.4 30.0
3 95215 STATION 2 25.7 82.1 24.2 21.0
4 95215 STATION 2 25.8 79.6 24.3 22.0
5 95215 STATION 2 25.9 78.3 24.4 16.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论