英文:
Partially flatten nested JSON and pivot longer
问题
我有许多具有以下结构的JSON文件:
{
"requestId": "test",
"executionDate": "2023-05-10",
"executionTime": "12:02:22",
"request": {
"fields": [{
"geometry": {
"type": "Point",
"coordinates": [-90, 41]
},
"colour": "blue",
"bean": "blaCk",
"birthday": "2021-01-01",
"arst": "111",
"arstg": "rst",
"fct": {
"start": "2011-01-10",
"end": "2012-01-10"
}
}]
},
"response": {
"results": [{
"geom": {
"type": "geo",
"coord": [-90, 41]
},
"md": {
"type": "arstat",
"mdl": "trstr",
"vs": "v0",
"cal": {
"num": 4,
"comment": "message"
},
"bean": ["blue", "green"],
"result_time": 12342
},
"predictions": [{
"date": "2004-05-19",
"day": 0,
"count": 0,
"eating_stage": "trt"
}, {
"date": "2002-01-20",
"day": 1,
"count": 0,
"eating_stage": "arstg"
}, {
"date": "2004-05-21",
"day": 2,
"count": 0,
"eating_stage": "strg"
}, {
"date": "2004-05-22",
"day": 3,
"count": 0,
"eating_stage": "rst"
}]
}]
}
}
预测部分可能会非常深。我想将此JSON转换为以下结构的CSV:
requestId|executionDate|executionTime|colour|predictions_date|predictions_day|predictions_count|predictions_eating_stage
---|---|---|---|---|---|---|---
test|2023-05-10|12:02:22|blue|2004-05-19|0|0|trt
test|2023-05-10|12:02:22|blue|2002-01-20|1|0|arstg
test|2023-05-10|12:02:22|blue|2004-05-21|2|0|strg
test|2023-05-10|12:02:22|blue|2004-05-22|3|0|rst
我尝试了以下代码:
flat_json = pd.DataFrame(
flatten(json_data), index=[0]
)
该代码导致每个数据点都变成了一列,我不确定如何在Python中使用JSON函数在'predictions'键处更长时间地进行旋转。我意识到在这个阶段,我可以使用列名进行更长时间地旋转,但我觉得有一种更简洁的方法来实现这一点。
英文:
I have many JSON files with the following structure:
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
{
"requestId": "test",
"executionDate": "2023-05-10",
"executionTime": "12:02:22",
"request": {
"fields": [{
"geometry": {
"type": "Point",
"coordinates": [-90, 41]
},
"colour": "blue",
"bean": "blaCk",
"birthday": "2021-01-01",
"arst": "111",
"arstg": "rst",
"fct": {
"start": "2011-01-10",
"end": "2012-01-10"
}
}]
},
"response": {
"results": [{
"geom": {
"type": "geo",
"coord": [-90, 41]
},
"md": {
"type": "arstat",
"mdl": "trstr",
"vs": "v0",
"cal": {
"num": 4,
"comment": "message"
},
"bean": ["blue", "green"],
"result_time": 12342
},
"predictions": [{
"date": "2004-05-19",
"day": 0,
"count": 0,
"eating_stage": "trt"
}, {
"date": "2002-01-20",
"day": 1,
"count": 0,
"eating_stage": "arstg"
}, {
"date": "2004-05-21",
"day": 2,
"count": 0,
"eating_stage": "strg"
}, {
"date": "2004-05-22",
"day": 3,
"count": 0,
"eating_stage": "rst"
}
}
}
}
<!-- end snippet -->
The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:
requestId | executionDate | executionTime | colour | predictions_date | predictions_day | predictions_count | predictions_eating_stage |
---|---|---|---|---|---|---|---|
test | 2023-05-10 | 12:02:22 | blue | 2004-05-19 | 0 | 0 | trt |
test | 2023-05-10 | 12:02:22 | blue | 2002-01-20 | 1 | 0 | astrg |
test | 2023-05-10 | 12:02:22 | blue | 2004-05-21 | 2 | 0 | strg |
test | 2023-05-10 | 12:02:22 | blue | 2004-05-22 | 3 | 0 | rst |
I tried the following code:
flat_json = pd.DataFrame(
flatten(json_data), index=[0]
)
The code results in every data point becoming a column, and I am not sure how to pivot longer where at the 'predictions' key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.
答案1
得分: 1
我建议简单地提取您所需的内容。这似乎非常具体,不需要使用特定的解析方法来解决。因此,我建议首先创建两个数据框:
df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x: y for x, y in example.items() if type(y) == str}, index=[0])
重命名预测列:
df_prediction.columns = ['prediction_' + x for x in df_prediction]
连接并添加最后的数据部分(颜色):
output = df_data.assign(colour=example['request']['fields'][0]['colour']).join(df_prediction, how='right').ffill()
输出结果:
requestId executionDate ... prediction_count prediction_eating_stage
0 test 2023-05-10 ... 0 trt
1 test 2023-05-10 ... 0 arstg
2 test 2023-05-10 ... 0 strg
3 test 2023-05-10 ... 0 rst
请注意,这些代码中使用了Python的pandas库,用于数据处理和操作。如果您需要进一步的帮助或解释,请随时告诉我。
英文:
I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:
df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])
Renaming columns in predictions:
df_prediction.columns = ['prediction_'+x for x in df_prediction]
Joining and adding the last piece of data (colour):
output = df_data.assign(colour = example['request']['fields'][0]['colour']).join(df_prediction,how='right').ffill()
Outputting:
requestId executionDate ... prediction_count prediction_eating_stage
0 test 2023-05-10 ... 0 trt
1 test 2023-05-10 ... 0 arstg
2 test 2023-05-10 ... 0 strg
3 test 2023-05-10 ... 0 rst
答案2
得分: 1
你也可以使用json_normalize来提取你想要规范化为csv的记录数组。
df_predictions = pd.json_normalize(json_data, record_path=['response', 'results', 'predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour=json_data['request']['fields'][0]['colour'])
df_predictions
不幸的是,meta字段存在限制,因为它对包含数组/列表的路径抛出异常,所以“colour”列是单独添加的。如果顺序很重要,那么你可以根据需要重新排列列。
英文:
You can also use json_normalize to extract the array of records that you want to normalize into a csv.
>>> df_predictions = pd.json_normalize(json_data,record_path=['response', 'results','predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour = json_data['request']['fields'][0]['colour'])
>>> df_predictions
predictions.date predictions.day predictions.count ... executionDate executionTime colour
0 2004-05-19 0 0 ... 2023-05-10 12:02:22 blue
1 2002-01-20 1 0 ... 2023-05-10 12:02:22 blue
2 2004-05-21 2 0 ... 2023-05-10 12:02:22 blue
3 2004-05-22 3 0 ... 2023-05-10 12:02:22 blue
[4 rows x 8 columns]
It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论