从大型 JSON 中获取唯一记录

huangapple go评论86阅读模式
英文:

Get unique records from huge json

问题

以下是您提供的代码的翻译部分:

  1. import json
  2. with open("data.json", "r") as f_in:
  3. data = json.load(f_in)
  4. values = []
  5. uniqueNames = []
  6. for i in data[0]['updated_data']:
  7. if i["session_id"] not in uniqueNames:
  8. uniqueNames.append(i["session_id"])
  9. values.append(i)

在这段代码中,您尝试从JSON数据中提取唯一的session_id 值,但这段代码有一些问题,导致它无法按预期工作。如果您需要更多帮助来解决问题,请提供更多上下文或具体的问题描述。

英文:

I have a input JSON it may have 50K data and I need to get unique key values from JSON.

Sample JSON :

  1. [
  2. {
  3. "updated_data": {
  4. "last_interaction": "2022-06-20 06:55:55.652434",
  5. "following_status": "followed",
  6. "session_id": 123,
  7. "type":"insert",
  8. "job_name": "blogger-following",
  9. "target": "name2",
  10. "liked": 1,
  11. "watched": 7,
  12. "commented": 0,
  13. "followed": False,
  14. "unfollowed": False,
  15. "scraped": False,
  16. "pm_sent": False
  17. }
  18. },
  19. {
  20. "updated_data": {
  21. "last_interaction": "2022-06-20 06:55:55.652434",
  22. "type":"insert",
  23. "following_status": "followed",
  24. "session_id": 3456,
  25. "job_name": "blogger-following",
  26. "target": "name3",
  27. "liked": 67,
  28. "watched": 78,
  29. "commented": 0,
  30. "followed": False,
  31. "unfollowed": False,
  32. "scraped": False,
  33. "pm_sent": False
  34. }
  35. },
  36. {
  37. "updated_data": {
  38. "last_interaction": "2022-06-20 06:55:55.652434",
  39. "following_status": "followed",
  40. "session_id": 6789,
  41. "type":"insert",
  42. "job_name": "blogger-following",
  43. "target": "name4",
  44. "liked": 210,
  45. "watched": 77,
  46. "commented": 0,
  47. "followed": False,
  48. "unfollowed": False,
  49. "scraped": False,
  50. "pm_sent": False
  51. }
  52. },
  53. {
  54. "updated_data": {
  55. "last_interaction": "2022-06-20 06:55:55.652434",
  56. "following_status": "followed",
  57. "session_id": 123,
  58. "type":"update",
  59. "job_name": "blogger-following",
  60. "target": "name5",
  61. "liked": 21,
  62. "watched": 790,
  63. "commented": 0,
  64. "followed": False,
  65. "unfollowed": False,
  66. "scraped": False,
  67. "pm_sent": False
  68. }
  69. },
  70. {
  71. "updated_data": {
  72. "last_interaction": "2022-06-20 06:55:55.652434",
  73. "following_status": "not followed",
  74. "session_id": 123456789,
  75. "type":"update",
  76. "job_name": "blogger-following",
  77. "target": "name6",
  78. "liked": 81,
  79. "watched": 7,
  80. "commented": 0,
  81. "followed": False,
  82. "unfollowed": False,
  83. "scraped": False,
  84. "pm_sent": False
  85. }
  86. },
  87. {
  88. "updated_data": {
  89. "last_interaction": "2023-06-20 06:55:55.652434",
  90. "following_status": "followed",
  91. "session_id": 123,
  92. "type":"update",
  93. "job_name": "blogger-following",
  94. "target": "name5",
  95. "liked": 21,
  96. "watched": 790,
  97. "commented": 0,
  98. "followed": False,
  99. "unfollowed": False,
  100. "scraped": False,
  101. "pm_sent": False
  102. }
  103. }
  104. ]

Here in the above input JSON session_id is the one which i need to check and get unique session_id json. If we see the difference for "session_id":"abc1" we have two. I need to take only one by checking "type" if two have same session Id's, condition is - need to check type should be "update" and pick the type "update" one and ignore "insert".

If i get more than two same session Id's, then i need to check type should be "update" and latest "last_interaction"

output should be:

  1. [
  2. {
  3. "updated_data": {
  4. "last_interaction": "2022-06-20 06:55:55.652434",
  5. "type":"insert",
  6. "following_status": "followed",
  7. "session_id": 3456,
  8. "job_name": "blogger-following",
  9. "target": "name3",
  10. "liked": 67,
  11. "watched": 78,
  12. "commented": 0,
  13. "followed": false,
  14. "unfollowed": false,
  15. "scraped": false,
  16. "pm_sent": false
  17. }
  18. },
  19. {
  20. "updated_data": {
  21. "last_interaction": "2022-06-20 06:55:55.652434",
  22. "following_status": "followed",
  23. "session_id": 6789,
  24. "type":"insert",
  25. "job_name": "blogger-following",
  26. "target": "name4",
  27. "liked": 210,
  28. "watched": 77,
  29. "commented": 0,
  30. "followed": false,
  31. "unfollowed": false,
  32. "scraped": false,
  33. "pm_sent": false
  34. }
  35. },
  36. {
  37. "updated_data": {
  38. "last_interaction": "2022-06-20 06:55:55.652434",
  39. "following_status": "followed",
  40. "session_id": 123,
  41. "type":"update",
  42. "job_name": "blogger-following",
  43. "target": "name5",
  44. "liked": 21,
  45. "watched": 790,
  46. "commented": 0,
  47. "followed": false,
  48. "unfollowed": false,
  49. "scraped": false,
  50. "pm_sent": false
  51. }
  52. },
  53. {
  54. "updated_data": {
  55. "last_interaction": "2022-06-20 06:55:55.652434",
  56. "following_status": "not followed",
  57. "session_id": 123456789,
  58. "type":"update",
  59. "job_name": "blogger-following",
  60. "target": "name6",
  61. "liked": 81,
  62. "watched": 7,
  63. "commented": 0,
  64. "followed": false,
  65. "unfollowed": false,
  66. "scraped": false,
  67. "pm_sent": false
  68. }
  69. }
  70. ]

I tried this below code,

  1. import json
  2. with open("data.json", "r") as f_in:
  3. data = json.load(f_in)
  4. values = [];
  5. uniqueNames = [];
  6. for i in data[0]['updated_data']:
  7. if(i["session_id"] not in uniqueNames):
  8. uniqueNames.append(i["session_id"]);
  9. values.append(i)

But, Not working as expected. Please provide your expertise to achieve in a efficient way.

答案1

得分: 1

我希望我理解了你的问题。此示例将从JSON文件中加载数据,根据session_id进行排序(首先是具有type == "update"的条目)。然后按session_id对数据进行分组,并从每个组中获取第一个元素:

  1. import json
  2. from itertools import groupby
  3. with open("your_data.json", "r") as f_in:
  4. data = json.load(f_in)
  5. out = []
  6. for _, g in groupby(
  7. sorted(
  8. data,
  9. key=lambda d: (
  10. d["updated_data"]["session_id"],
  11. d["updated_data"]["last_interaction"],
  12. d["updated_data"]["type"] == "update",
  13. ),
  14. reverse=True,
  15. ),
  16. lambda d: d["updated_data"]["session_id"],
  17. ):
  18. out.append(next(g))
  19. print(out)

打印输出:

  1. [
  2. {
  3. "updated_data": {
  4. "last_interaction": "2022-06-20 06:55:55.652434",
  5. "following_status": "not followed",
  6. "session_id": 123456789,
  7. "type": "update",
  8. "job_name": "blogger-following",
  9. "target": "name6",
  10. "liked": 81,
  11. "watched": 7,
  12. "commented": 0,
  13. "followed": False,
  14. "unfollowed": False,
  15. "scraped": False,
  16. "pm_sent": False
  17. }
  18. },
  19. {
  20. "updated_data": {
  21. "last_interaction": "2022-06-20 06:55:55.652434",
  22. "following_status": "followed",
  23. "session_id": 6789,
  24. "type": "insert",
  25. "job_name": "blogger-following",
  26. "target": "name4",
  27. "liked": 210,
  28. "watched": 77,
  29. "commented": 0,
  30. "followed": False,
  31. "unfollowed": False,
  32. "scraped": False,
  33. "pm_sent": False
  34. }
  35. },
  36. {
  37. "updated_data": {
  38. "last_interaction": "2022-06-20 06:55:55.652434",
  39. "type": "insert",
  40. "following_status": "followed",
  41. "session_id": 3456,
  42. "job_name": "blogger-following",
  43. "target": "name3",
  44. "liked": 67,
  45. "watched": 78,
  46. "commented": 0,
  47. "followed": False,
  48. "unfollowed": False,
  49. "scraped": False,
  50. "pm_sent": False
  51. }
  52. },
  53. {
  54. "updated_data": {
  55. "last_interaction": "2023-06-20 06:55:55.652434",
  56. "following_status": "followed",
  57. "session_id": 123,
  58. "type": "update",
  59. "job_name": "blogger-following",
  60. "target": "name5",
  61. "liked": 21,
  62. "watched": 790,
  63. "commented": 0,
  64. "followed": False,
  65. "unfollowed": False,
  66. "scraped": False,
  67. "pm_sent": False
  68. }
  69. }
  70. ]
英文:

I hope I've understood your question right. This example will load the data from the json file, sort it according session_id (with entries that have type == "update" first). Then group the data according session_id and get first element from each group:

  1. import json
  2. from itertools import groupby
  3. with open("your_data.json", "r") as f_in:
  4. data = json.load(f_in)
  5. out = []
  6. for _, g in groupby(
  7. sorted(
  8. data,
  9. key=lambda d: (
  10. d["updated_data"]["session_id"],
  11. d["updated_data"]["last_interaction"],
  12. d["updated_data"]["type"] == "update",
  13. ),
  14. reverse=True,
  15. ),
  16. lambda d: d["updated_data"]["session_id"],
  17. ):
  18. out.append(next(g))
  19. print(out)

Prints:

  1. [
  2. {
  3. "updated_data": {
  4. "last_interaction": "2022-06-20 06:55:55.652434",
  5. "following_status": "not followed",
  6. "session_id": 123456789,
  7. "type": "update",
  8. "job_name": "blogger-following",
  9. "target": "name6",
  10. "liked": 81,
  11. "watched": 7,
  12. "commented": 0,
  13. "followed": False,
  14. "unfollowed": False,
  15. "scraped": False,
  16. "pm_sent": False,
  17. }
  18. },
  19. {
  20. "updated_data": {
  21. "last_interaction": "2022-06-20 06:55:55.652434",
  22. "following_status": "followed",
  23. "session_id": 6789,
  24. "type": "insert",
  25. "job_name": "blogger-following",
  26. "target": "name4",
  27. "liked": 210,
  28. "watched": 77,
  29. "commented": 0,
  30. "followed": False,
  31. "unfollowed": False,
  32. "scraped": False,
  33. "pm_sent": False,
  34. }
  35. },
  36. {
  37. "updated_data": {
  38. "last_interaction": "2022-06-20 06:55:55.652434",
  39. "type": "insert",
  40. "following_status": "followed",
  41. "session_id": 3456,
  42. "job_name": "blogger-following",
  43. "target": "name3",
  44. "liked": 67,
  45. "watched": 78,
  46. "commented": 0,
  47. "followed": False,
  48. "unfollowed": False,
  49. "scraped": False,
  50. "pm_sent": False,
  51. }
  52. },
  53. {
  54. "updated_data": {
  55. "last_interaction": "2023-06-20 06:55:55.652434",
  56. "following_status": "followed",
  57. "session_id": 123,
  58. "type": "update",
  59. "job_name": "blogger-following",
  60. "target": "name5",
  61. "liked": 21,
  62. "watched": 790,
  63. "commented": 0,
  64. "followed": False,
  65. "unfollowed": False,
  66. "scraped": False,
  67. "pm_sent": False,
  68. }
  69. },
  70. ]

huangapple
  • 本文由 发表于 2023年2月23日 22:49:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75546421.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定