英文:
Get unique records from huge json
问题
以下是您提供的代码的翻译部分:
import json
with open("data.json", "r") as f_in:
data = json.load(f_in)
values = []
uniqueNames = []
for i in data[0]['updated_data']:
if i["session_id"] not in uniqueNames:
uniqueNames.append(i["session_id"])
values.append(i)
在这段代码中,您尝试从JSON数据中提取唯一的session_id 值,但这段代码有一些问题,导致它无法按预期工作。如果您需要更多帮助来解决问题,请提供更多上下文或具体的问题描述。
英文:
I have a input JSON it may have 50K data and I need to get unique key values from JSON.
Sample JSON :
[
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type":"insert",
"job_name": "blogger-following",
"target": "name2",
"liked": 1,
"watched": 7,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"type":"insert",
"following_status": "followed",
"session_id": 3456,
"job_name": "blogger-following",
"target": "name3",
"liked": 67,
"watched": 78,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 6789,
"type":"insert",
"job_name": "blogger-following",
"target": "name4",
"liked": 210,
"watched": 77,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type":"update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "not followed",
"session_id": 123456789,
"type":"update",
"job_name": "blogger-following",
"target": "name6",
"liked": 81,
"watched": 7,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2023-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type":"update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
}
]
Here in the above input JSON session_id is the one which i need to check and get unique session_id json. If we see the difference for "session_id":"abc1" we have two. I need to take only one by checking "type" if two have same session Id's, condition is - need to check type should be "update" and pick the type "update" one and ignore "insert".
If i get more than two same session Id's, then i need to check type should be "update" and latest "last_interaction"
output should be:
[
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"type":"insert",
"following_status": "followed",
"session_id": 3456,
"job_name": "blogger-following",
"target": "name3",
"liked": 67,
"watched": 78,
"commented": 0,
"followed": false,
"unfollowed": false,
"scraped": false,
"pm_sent": false
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 6789,
"type":"insert",
"job_name": "blogger-following",
"target": "name4",
"liked": 210,
"watched": 77,
"commented": 0,
"followed": false,
"unfollowed": false,
"scraped": false,
"pm_sent": false
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type":"update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": false,
"unfollowed": false,
"scraped": false,
"pm_sent": false
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "not followed",
"session_id": 123456789,
"type":"update",
"job_name": "blogger-following",
"target": "name6",
"liked": 81,
"watched": 7,
"commented": 0,
"followed": false,
"unfollowed": false,
"scraped": false,
"pm_sent": false
}
}
]
I tried this below code,
import json
with open("data.json", "r") as f_in:
data = json.load(f_in)
values = [];
uniqueNames = [];
for i in data[0]['updated_data']:
if(i["session_id"] not in uniqueNames):
uniqueNames.append(i["session_id"]);
values.append(i)
But, Not working as expected. Please provide your expertise to achieve in a efficient way.
答案1
得分: 1
我希望我理解了你的问题。此示例将从JSON文件中加载数据,根据session_id
进行排序(首先是具有type == "update"
的条目)。然后按session_id
对数据进行分组,并从每个组中获取第一个元素:
import json
from itertools import groupby
with open("your_data.json", "r") as f_in:
data = json.load(f_in)
out = []
for _, g in groupby(
sorted(
data,
key=lambda d: (
d["updated_data"]["session_id"],
d["updated_data"]["last_interaction"],
d["updated_data"]["type"] == "update",
),
reverse=True,
),
lambda d: d["updated_data"]["session_id"],
):
out.append(next(g))
print(out)
打印输出:
[
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "not followed",
"session_id": 123456789,
"type": "update",
"job_name": "blogger-following",
"target": "name6",
"liked": 81,
"watched": 7,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 6789,
"type": "insert",
"job_name": "blogger-following",
"target": "name4",
"liked": 210,
"watched": 77,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"type": "insert",
"following_status": "followed",
"session_id": 3456,
"job_name": "blogger-following",
"target": "name3",
"liked": 67,
"watched": 78,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
},
{
"updated_data": {
"last_interaction": "2023-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type": "update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False
}
}
]
英文:
I hope I've understood your question right. This example will load the data from the json file, sort it according session_id
(with entries that have type == "update"
first). Then group the data according session_id
and get first element from each group:
import json
from itertools import groupby
with open("your_data.json", "r") as f_in:
data = json.load(f_in)
out = []
for _, g in groupby(
sorted(
data,
key=lambda d: (
d["updated_data"]["session_id"],
d["updated_data"]["last_interaction"],
d["updated_data"]["type"] == "update",
),
reverse=True,
),
lambda d: d["updated_data"]["session_id"],
):
out.append(next(g))
print(out)
Prints:
[
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "not followed",
"session_id": 123456789,
"type": "update",
"job_name": "blogger-following",
"target": "name6",
"liked": 81,
"watched": 7,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 6789,
"type": "insert",
"job_name": "blogger-following",
"target": "name4",
"liked": 210,
"watched": 77,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"type": "insert",
"following_status": "followed",
"session_id": 3456,
"job_name": "blogger-following",
"target": "name3",
"liked": 67,
"watched": 78,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2023-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type": "update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论