从大型 JSON 中获取唯一记录

huangapple go评论58阅读模式
英文:

Get unique records from huge json

问题

以下是您提供的代码的翻译部分:

import json

with open("data.json", "r") as f_in:
    data = json.load(f_in)

values = []
uniqueNames = []
for i in data[0]['updated_data']:
    if i["session_id"] not in uniqueNames:
        uniqueNames.append(i["session_id"])
        values.append(i)

在这段代码中,您尝试从JSON数据中提取唯一的session_id 值,但这段代码有一些问题,导致它无法按预期工作。如果您需要更多帮助来解决问题,请提供更多上下文或具体的问题描述。

英文:

I have a input JSON it may have 50K data and I need to get unique key values from JSON.

Sample JSON :

  [
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 123,
      "type":"insert",
      "job_name": "blogger-following",
      "target": "name2",
      "liked": 1,
      "watched": 7,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "type":"insert",
      "following_status": "followed",
      "session_id": 3456,
      "job_name": "blogger-following",
      "target": "name3",
      "liked": 67,
      "watched": 78,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 6789,
      "type":"insert",
      "job_name": "blogger-following",
      "target": "name4",
      "liked": 210,
      "watched": 77,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 123,
      "type":"update",
      "job_name": "blogger-following",
      "target": "name5",
      "liked": 21,
      "watched": 790,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "not followed",
      "session_id": 123456789,
      "type":"update",
      "job_name": "blogger-following",
      "target": "name6",
      "liked": 81,
      "watched": 7,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  },
  {
    "updated_data": {
      "last_interaction": "2023-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 123,
      "type":"update",
      "job_name": "blogger-following",
      "target": "name5",
      "liked": 21,
      "watched": 790,
      "commented": 0,
      "followed": False,
      "unfollowed": False,
      "scraped": False,
      "pm_sent": False
    }
  }
]

Here in the above input JSON session_id is the one which i need to check and get unique session_id json. If we see the difference for "session_id":"abc1" we have two. I need to take only one by checking "type" if two have same session Id's, condition is - need to check type should be "update" and pick the type "update" one and ignore "insert".

If i get more than two same session Id's, then i need to check type should be "update" and latest "last_interaction"

output should be:

    [

  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "type":"insert",
      "following_status": "followed",
      "session_id": 3456,
      "job_name": "blogger-following",
      "target": "name3",
      "liked": 67,
      "watched": 78,
      "commented": 0,
      "followed": false,
      "unfollowed": false,
      "scraped": false,
      "pm_sent": false
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 6789,
      "type":"insert",
      "job_name": "blogger-following",
      "target": "name4",
      "liked": 210,
      "watched": 77,
      "commented": 0,
      "followed": false,
      "unfollowed": false,
      "scraped": false,
      "pm_sent": false
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "followed",
      "session_id": 123,
      "type":"update",
      "job_name": "blogger-following",
      "target": "name5",
      "liked": 21,
      "watched": 790,
      "commented": 0,
      "followed": false,
      "unfollowed": false,
      "scraped": false,
      "pm_sent": false
    }
  },
  {
    "updated_data": {
      "last_interaction": "2022-06-20 06:55:55.652434",
      "following_status": "not followed",
      "session_id": 123456789,
      "type":"update",
      "job_name": "blogger-following",
      "target": "name6",
      "liked": 81,
      "watched": 7,
      "commented": 0,
      "followed": false,
      "unfollowed": false,
      "scraped": false,
      "pm_sent": false
    }
  }
]

I tried this below code,

import json

with open("data.json", "r") as f_in:
    data = json.load(f_in)

values = [];
uniqueNames = [];
for i in data[0]['updated_data']:
    if(i["session_id"] not in uniqueNames):
         uniqueNames.append(i["session_id"]);
         values.append(i) 

But, Not working as expected. Please provide your expertise to achieve in a efficient way.

答案1

得分: 1

我希望我理解了你的问题。此示例将从JSON文件中加载数据,根据session_id进行排序(首先是具有type == "update"的条目)。然后按session_id对数据进行分组,并从每个组中获取第一个元素:

import json
from itertools import groupby

with open("your_data.json", "r") as f_in:
    data = json.load(f_in)

out = []
for _, g in groupby(
    sorted(
        data,
        key=lambda d: (
            d["updated_data"]["session_id"],
            d["updated_data"]["last_interaction"],
            d["updated_data"]["type"] == "update",
        ),
        reverse=True,
    ),
    lambda d: d["updated_data"]["session_id"],
):
    out.append(next(g))

print(out)

打印输出:

[
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "following_status": "not followed",
            "session_id": 123456789,
            "type": "update",
            "job_name": "blogger-following",
            "target": "name6",
            "liked": 81,
            "watched": 7,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "following_status": "followed",
            "session_id": 6789,
            "type": "insert",
            "job_name": "blogger-following",
            "target": "name4",
            "liked": 210,
            "watched": 77,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "type": "insert",
            "following_status": "followed",
            "session_id": 3456,
            "job_name": "blogger-following",
            "target": "name3",
            "liked": 67,
            "watched": 78,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2023-06-20 06:55:55.652434",
            "following_status": "followed",
            "session_id": 123,
            "type": "update",
            "job_name": "blogger-following",
            "target": "name5",
            "liked": 21,
            "watched": 790,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    }
]
英文:

I hope I've understood your question right. This example will load the data from the json file, sort it according session_id (with entries that have type == "update" first). Then group the data according session_id and get first element from each group:

import json
from itertools import groupby


with open("your_data.json", "r") as f_in:
    data = json.load(f_in)

out = []
for _, g in groupby(
    sorted(
        data,
        key=lambda d: (
            d["updated_data"]["session_id"],
            d["updated_data"]["last_interaction"],
            d["updated_data"]["type"] == "update",
        ),
        reverse=True,
    ),
    lambda d: d["updated_data"]["session_id"],
):
    out.append(next(g))

print(out)

Prints:

[
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "not followed",
"session_id": 123456789,
"type": "update",
"job_name": "blogger-following",
"target": "name6",
"liked": 81,
"watched": 7,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 6789,
"type": "insert",
"job_name": "blogger-following",
"target": "name4",
"liked": 210,
"watched": 77,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2022-06-20 06:55:55.652434",
"type": "insert",
"following_status": "followed",
"session_id": 3456,
"job_name": "blogger-following",
"target": "name3",
"liked": 67,
"watched": 78,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
{
"updated_data": {
"last_interaction": "2023-06-20 06:55:55.652434",
"following_status": "followed",
"session_id": 123,
"type": "update",
"job_name": "blogger-following",
"target": "name5",
"liked": 21,
"watched": 790,
"commented": 0,
"followed": False,
"unfollowed": False,
"scraped": False,
"pm_sent": False,
}
},
]

huangapple
  • 本文由 发表于 2023年2月23日 22:49:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75546421.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定