2023年2月23日 22:49:14go评论86阅读模式

英文:

Get unique records from huge json

问题

以下是您提供的代码的翻译部分：

import json
with open("data.json", "r") as f_in:
    data = json.load(f_in)
values = []
uniqueNames = []
for i in data[0]['updated_data']:
    if i["session_id"] not in uniqueNames:
        uniqueNames.append(i["session_id"])
        values.append(i)

在这段代码中，您尝试从JSON数据中提取唯一的session_id 值，但这段代码有一些问题，导致它无法按预期工作。如果您需要更多帮助来解决问题，请提供更多上下文或具体的问题描述。

英文:

I have a input JSON it may have 50K data and I need to get unique key values from JSON.

Sample JSON :

  [
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 123,
      &quot;type&quot;:&quot;insert&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name2&quot;,
      &quot;liked&quot;: 1,
      &quot;watched&quot;: 7,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;type&quot;:&quot;insert&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 3456,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name3&quot;,
      &quot;liked&quot;: 67,
      &quot;watched&quot;: 78,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 6789,
      &quot;type&quot;:&quot;insert&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name4&quot;,
      &quot;liked&quot;: 210,
      &quot;watched&quot;: 77,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 123,
      &quot;type&quot;:&quot;update&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name5&quot;,
      &quot;liked&quot;: 21,
      &quot;watched&quot;: 790,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;not followed&quot;,
      &quot;session_id&quot;: 123456789,
      &quot;type&quot;:&quot;update&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name6&quot;,
      &quot;liked&quot;: 81,
      &quot;watched&quot;: 7,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2023-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 123,
      &quot;type&quot;:&quot;update&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name5&quot;,
      &quot;liked&quot;: 21,
      &quot;watched&quot;: 790,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: False,
      &quot;unfollowed&quot;: False,
      &quot;scraped&quot;: False,
      &quot;pm_sent&quot;: False
    }
  }
]

Here in the above input JSON session_id is the one which i need to check and get unique session_id json. If we see the difference for "session_id":"abc1" we have two. I need to take only one by checking "type" if two have same session Id's, condition is - need to check type should be "update" and pick the type "update" one and ignore "insert".

If i get more than two same session Id's, then i need to check type should be "update" and latest "last_interaction"

output should be:

    [
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;type&quot;:&quot;insert&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 3456,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name3&quot;,
      &quot;liked&quot;: 67,
      &quot;watched&quot;: 78,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: false,
      &quot;unfollowed&quot;: false,
      &quot;scraped&quot;: false,
      &quot;pm_sent&quot;: false
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 6789,
      &quot;type&quot;:&quot;insert&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name4&quot;,
      &quot;liked&quot;: 210,
      &quot;watched&quot;: 77,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: false,
      &quot;unfollowed&quot;: false,
      &quot;scraped&quot;: false,
      &quot;pm_sent&quot;: false
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;followed&quot;,
      &quot;session_id&quot;: 123,
      &quot;type&quot;:&quot;update&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name5&quot;,
      &quot;liked&quot;: 21,
      &quot;watched&quot;: 790,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: false,
      &quot;unfollowed&quot;: false,
      &quot;scraped&quot;: false,
      &quot;pm_sent&quot;: false
    }
  },
  {
    &quot;updated_data&quot;: {
      &quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
      &quot;following_status&quot;: &quot;not followed&quot;,
      &quot;session_id&quot;: 123456789,
      &quot;type&quot;:&quot;update&quot;,
      &quot;job_name&quot;: &quot;blogger-following&quot;,
      &quot;target&quot;: &quot;name6&quot;,
      &quot;liked&quot;: 81,
      &quot;watched&quot;: 7,
      &quot;commented&quot;: 0,
      &quot;followed&quot;: false,
      &quot;unfollowed&quot;: false,
      &quot;scraped&quot;: false,
      &quot;pm_sent&quot;: false
    }
  }
]

I tried this below code,

import json
with open(&quot;data.json&quot;, &quot;r&quot;) as f_in:
    data = json.load(f_in)
values = [];
uniqueNames = [];
for i in data[0][&#39;updated_data&#39;]:
    if(i[&quot;session_id&quot;] not in uniqueNames):
         uniqueNames.append(i[&quot;session_id&quot;]);
         values.append(i)

But, Not working as expected. Please provide your expertise to achieve in a efficient way.

答案1

得分: 1

我希望我理解了你的问题。此示例将从JSON文件中加载数据，根据session_id进行排序（首先是具有type == "update"的条目）。然后按session_id对数据进行分组，并从每个组中获取第一个元素：

import json
from itertools import groupby
with open("your_data.json", "r") as f_in:
    data = json.load(f_in)
out = []
for _, g in groupby(
    sorted(
        data,
        key=lambda d: (
            d["updated_data"]["session_id"],
            d["updated_data"]["last_interaction"],
            d["updated_data"]["type"] == "update",
        ),
        reverse=True,
    ),
    lambda d: d["updated_data"]["session_id"],
):
    out.append(next(g))
print(out)

打印输出：

[
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "following_status": "not followed",
            "session_id": 123456789,
            "type": "update",
            "job_name": "blogger-following",
            "target": "name6",
            "liked": 81,
            "watched": 7,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "following_status": "followed",
            "session_id": 6789,
            "type": "insert",
            "job_name": "blogger-following",
            "target": "name4",
            "liked": 210,
            "watched": 77,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2022-06-20 06:55:55.652434",
            "type": "insert",
            "following_status": "followed",
            "session_id": 3456,
            "job_name": "blogger-following",
            "target": "name3",
            "liked": 67,
            "watched": 78,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    },
    {
        "updated_data": {
            "last_interaction": "2023-06-20 06:55:55.652434",
            "following_status": "followed",
            "session_id": 123,
            "type": "update",
            "job_name": "blogger-following",
            "target": "name5",
            "liked": 21,
            "watched": 790,
            "commented": 0,
            "followed": False,
            "unfollowed": False,
            "scraped": False,
            "pm_sent": False
        }
    }
]

英文:

I hope I've understood your question right. This example will load the data from the json file, sort it according session_id (with entries that have type == "update" first). Then group the data according session_id and get first element from each group:

import json
from itertools import groupby
with open(&quot;your_data.json&quot;, &quot;r&quot;) as f_in:
    data = json.load(f_in)
out = []
for _, g in groupby(
    sorted(
        data,
        key=lambda d: (
            d[&quot;updated_data&quot;][&quot;session_id&quot;],
            d[&quot;updated_data&quot;][&quot;last_interaction&quot;],
            d[&quot;updated_data&quot;][&quot;type&quot;] == &quot;update&quot;,
        ),
        reverse=True,
    ),
    lambda d: d[&quot;updated_data&quot;][&quot;session_id&quot;],
):
    out.append(next(g))
print(out)

Prints:

[
{
&quot;updated_data&quot;: {
&quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
&quot;following_status&quot;: &quot;not followed&quot;,
&quot;session_id&quot;: 123456789,
&quot;type&quot;: &quot;update&quot;,
&quot;job_name&quot;: &quot;blogger-following&quot;,
&quot;target&quot;: &quot;name6&quot;,
&quot;liked&quot;: 81,
&quot;watched&quot;: 7,
&quot;commented&quot;: 0,
&quot;followed&quot;: False,
&quot;unfollowed&quot;: False,
&quot;scraped&quot;: False,
&quot;pm_sent&quot;: False,
}
},
{
&quot;updated_data&quot;: {
&quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
&quot;following_status&quot;: &quot;followed&quot;,
&quot;session_id&quot;: 6789,
&quot;type&quot;: &quot;insert&quot;,
&quot;job_name&quot;: &quot;blogger-following&quot;,
&quot;target&quot;: &quot;name4&quot;,
&quot;liked&quot;: 210,
&quot;watched&quot;: 77,
&quot;commented&quot;: 0,
&quot;followed&quot;: False,
&quot;unfollowed&quot;: False,
&quot;scraped&quot;: False,
&quot;pm_sent&quot;: False,
}
},
{
&quot;updated_data&quot;: {
&quot;last_interaction&quot;: &quot;2022-06-20 06:55:55.652434&quot;,
&quot;type&quot;: &quot;insert&quot;,
&quot;following_status&quot;: &quot;followed&quot;,
&quot;session_id&quot;: 3456,
&quot;job_name&quot;: &quot;blogger-following&quot;,
&quot;target&quot;: &quot;name3&quot;,
&quot;liked&quot;: 67,
&quot;watched&quot;: 78,
&quot;commented&quot;: 0,
&quot;followed&quot;: False,
&quot;unfollowed&quot;: False,
&quot;scraped&quot;: False,
&quot;pm_sent&quot;: False,
}
},
{
&quot;updated_data&quot;: {
&quot;last_interaction&quot;: &quot;2023-06-20 06:55:55.652434&quot;,
&quot;following_status&quot;: &quot;followed&quot;,
&quot;session_id&quot;: 123,
&quot;type&quot;: &quot;update&quot;,
&quot;job_name&quot;: &quot;blogger-following&quot;,
&quot;target&quot;: &quot;name5&quot;,
&quot;liked&quot;: 21,
&quot;watched&quot;: 790,
&quot;commented&quot;: 0,
&quot;followed&quot;: False,
&quot;unfollowed&quot;: False,
&quot;scraped&quot;: False,
&quot;pm_sent&quot;: False,
}
},
]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从大型 JSON 中获取唯一记录

问题

答案1

React fetch from Go/Golang server parsing data in unexpected JSON format to number

从Django模型中获取项目并在HTML中显示。

在Python中的石头剪刀布游戏中出现的问题。

在列表中更具Python风格地迭代包含的元组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。