如何改进嵌套理解的性能?

huangapple go评论140阅读模式
英文:

How can I improve performance of nested comprehension?

问题

我尝试使用Python 3.x的推导式来创建一个嵌套字典结构。我的推导式语法有效,但对于大型数据集来说速度非常慢。我已经使用循环创建了我想要的数据结构,它运行得更快,但我想知道是否有方法可以改进这个推导式,使其更有效率,可能能够与我的循环代码一样快甚至更快地运行。

我的输入数据是一个包含字典的列表,每个字典都描述了业余无线电联系(日志条目)的具体信息。以下是我的数据的随机子集(限制为20个条目,并删除了字典中的非关键信息,以使其更清晰):

我想创建一个字典,其中每个键都是一个波段(10M、20M等),值将是一个字典,列出该波段上联系的国家作为键,每个国家上的联系计数作为值。以下是我的输出示例:

这是我想出的用于创建输出的推导式。它有效,对于这里显示的有限数据集,运行速度很快,但对于包含几千个条目的输入列表,运行时间很长。

  1. worked_dxcc_by_band = {
  2. z["BAND"]: {
  3. x["COUNTRY"]: len([y["COUNTRY"]
  4. for y in log_entries
  5. if y["COUNTRY"] == x["COUNTRY"] and y["BAND"] == z["BAND"]])
  6. for x in log_entries
  7. if x["BAND"] == z["BAND"]
  8. }
  9. for z in log_entries
  10. }

由于这是一个三重嵌套的推导式,所有三个循环都遍历整个log_entries列表,我认为这就是为什么它变得非常慢的原因。

是否有更有效的方法可以使用推导式来完成这个任务?我可以使用循环处理数据,但我试图提高使用推导式的技能,所以我认为这将是一个很好的练习!

这是我在不使用推导式的情况下所做的:我有一个名为analyze_log_entry的函数,每次从文件中加载一个日志条目时都会调用它。

  1. from collections import Counter
  2. worked_dxcc_by_band = {}
  3. def analyze_log_entry(entry):
  4. if "BAND" in entry:
  5. if "COUNTRY" in entry:
  6. if entry["BAND"] in worked_dxcc_by_band:
  7. worked_dxcc_by_band[entry["BAND"]][entry["COUNTRY"]] += 1
  8. else:
  9. worked_dxcc_by_band[entry["BAND"]] = Counter()
  10. worked_dxcc_by_band[entry["BAND"]][entry["COUNTRY"]] = 1

这本身可能不是非常高效,但我的完整代码在analyze_log_entry函数中有许多类似的块,用于构建多个字典。因为我只遍历我的数据一次,并在适当的地方构建字典,所以它可能比使用推导式更高效,后者本质上是多个循环。正如我所说,这更多地是一个练习,以学习如何使用不同的方法完成相同的任务。

英文:

I am trying to use python 3.x comprehension to create a nested dictionary structure. My comprehension syntax works, but it is very slow, especially with a large data set. I have also created my desired data structure using loops and it runs much faster, but I would like to know if there is a way to improve this comprehension to make it more efficient and potentially run as fast as, or faster than my loop code.

My input data is a list of dictionaries, each dictionary outlining the specifics of an amateur radio contact (log entry). Here is a random subset of my data (limited to 20 entries, and non-essential keys in the dictionary removed to make this more clear)

  1. [{'BAND': '20M',
  2. 'CALL': 'AA9GL',
  3. 'COUNTRY': 'UNITED STATES OF AMERICA',
  4. 'QSO_DATE': '20170528',
  5. 'TIME_ON': '132100'},
  6. {'BAND': '20M',
  7. 'CALL': 'KE4BFI',
  8. 'COUNTRY': 'UNITED STATES OF AMERICA',
  9. 'QSO_DATE': '20150704',
  10. 'TIME_ON': '034600'},
  11. {'BAND': '20M',
  12. 'CALL': 'W8OTR',
  13. 'COUNTRY': 'UNITED STATES OF AMERICA',
  14. 'QSO_DATE': '20190119',
  15. 'TIME_ON': '194645'},
  16. {'BAND': '10M',
  17. 'CALL': 'FY5FY',
  18. 'COUNTRY': 'FRENCH GUIANA',
  19. 'QSO_DATE': '20150328',
  20. 'TIME_ON': '161953'},
  21. {'BAND': '17M',
  22. 'CALL': 'KD5FOY',
  23. 'COUNTRY': 'UNITED STATES OF AMERICA',
  24. 'QSO_DATE': '20190121',
  25. 'TIME_ON': '145630'},
  26. {'BAND': '10M',
  27. 'CALL': 'K5GQ',
  28. 'COUNTRY': 'UNITED STATES OF AMERICA',
  29. 'QSO_DATE': '20150110',
  30. 'TIME_ON': '195326'},
  31. {'BAND': '10M',
  32. 'CALL': 'CR5L',
  33. 'COUNTRY': 'PORTUGAL',
  34. 'QSO_DATE': '20151025',
  35. 'TIME_ON': '182351'},
  36. {'BAND': '20M',
  37. 'CALL': 'AD4TR',
  38. 'COUNTRY': 'UNITED STATES OF AMERICA',
  39. 'QSO_DATE': '20170325',
  40. 'TIME_ON': '144606'},
  41. {'BAND': '40M',
  42. 'CALL': 'EA8FJ',
  43. 'COUNTRY': 'CANARY ISLANDS',
  44. 'QSO_DATE': '20170618',
  45. 'TIME_ON': '020300'},
  46. {'BAND': '10M',
  47. 'CALL': 'PY2DPM',
  48. 'COUNTRY': 'BRAZIL',
  49. 'QSO_DATE': '20150104',
  50. 'TIME_ON': '205900'},
  51. {'BAND': '17M',
  52. 'CALL': 'MM0HVU',
  53. 'COUNTRY': 'SCOTLAND',
  54. 'QSO_DATE': '20170416',
  55. 'TIME_ON': '130200'},
  56. {'BAND': '10M',
  57. 'CALL': 'LW3DG',
  58. 'COUNTRY': 'ARGENTINA',
  59. 'QSO_DATE': '20161029',
  60. 'TIME_ON': '210629'},
  61. {'BAND': '10M',
  62. 'CALL': 'LW3DG',
  63. 'COUNTRY': 'ARGENTINA',
  64. 'QSO_DATE': '20151025',
  65. 'TIME_ON': '210714'},
  66. {'BAND': '20M',
  67. 'CALL': 'EI7HDB',
  68. 'COUNTRY': 'IRELAND',
  69. 'QSO_DATE': '20170423',
  70. 'TIME_ON': '184000'},
  71. {'BAND': '20M',
  72. 'CALL': 'KM0NAS',
  73. 'COUNTRY': 'UNITED STATES OF AMERICA',
  74. 'QSO_DATE': '20180102',
  75. 'TIME_ON': '142151'},
  76. {'BAND': '10M',
  77. 'CALL': 'PY2TKB',
  78. 'COUNTRY': 'BRAZIL',
  79. 'QSO_DATE': '20150328',
  80. 'TIME_ON': '223535'},
  81. {'BAND': '40M',
  82. 'CALL': 'EB1DJ',
  83. 'COUNTRY': 'SPAIN',
  84. 'QSO_DATE': '20170326',
  85. 'TIME_ON': '232430'},
  86. {'BAND': '40M',
  87. 'CALL': 'LU6PCK',
  88. 'COUNTRY': 'ARGENTINA',
  89. 'QSO_DATE': '20150615',
  90. 'TIME_ON': '000200'},
  91. {'BAND': '17M',
  92. 'CALL': 'G3RKF',
  93. 'COUNTRY': 'ENGLAND',
  94. 'QSO_DATE': '20190121',
  95. 'TIME_ON': '144315'},
  96. {'BAND': '20M',
  97. 'CALL': 'UA1ZKI',
  98. 'COUNTRY': 'EUROPEAN RUSSIA',
  99. 'QSO_DATE': '20170508',
  100. 'TIME_ON': '141400'}]

I want to create a dictionary where each key is a band (10M, 20M, etc) and the value will be a dictionary listing the counties contacted on that band as keys and a count of contacts for each country on that band as the values. Here is what my output looks like:

  1. {'10M': {'ARGENTINA': 2,
  2. 'BRAZIL': 2,
  3. 'FRENCH GUIANA': 1,
  4. 'PORTUGAL': 1,
  5. 'UNITED STATES OF AMERICA': 1},
  6. '17M': {'ENGLAND': 1, 'SCOTLAND': 1, 'UNITED STATES OF AMERICA': 1},
  7. '20M': {'EUROPEAN RUSSIA': 1, 'IRELAND': 1, 'UNITED STATES OF AMERICA': 5},
  8. '40M': {'ARGENTINA': 1, 'CANARY ISLANDS': 1, 'SPAIN': 1}}

This is the comprehension that I came up with to create the output. It works, and with the limited data set shown here, it runs quickly, but with an input list of a couple thousand entries, it takes quite a long time to run.

  1. worked_dxcc_by_band = {
  2. z["BAND"]: {
  3. x["COUNTRY"]: len([y["COUNTRY"]
  4. for y in log_entries
  5. if y["COUNTRY"] == x["COUNTRY"] and y["BAND"] == z["BAND"]])
  6. for x in log_entries
  7. if x["BAND"] == z["BAND"]
  8. }
  9. for z in log_entries
  10. }

Because this is a triple-nested comprehension, and all 3 loops run through the entire log_entries list, I am assuming that is why it gets very slow.

Is there a more efficient way to accomplish this with comprehension? I am fine using my loop to process the data but I am trying to enhance my skills regarding comprehensions so I thought this would be a good exercise!

This is what I am doing without using comprehension: I have a function analyize_log_entry which I call as I load each log entry in from a file.

  1. from collections import Counter
  2. worked_dxcc_by_band = {}
  3. def analyze_log_entry(entry):
  4. if "BAND" in entry:
  5. if "COUNTRY" in entry:
  6. if entry["BAND"] in worked_dxcc_by_band:
  7. worked_dxcc_by_band[entry["BAND"]][entry["COUNTRY"]] += 1
  8. else:
  9. worked_dxcc_by_band[entry["BAND"]] = Counter()
  10. worked_dxcc_by_band[entry["BAND"]][entry["COUNTRY"]] = 1

This in itself may not be that efficient but my full code has many similar blocks within the analyze_log_entry function that build multiple dictionaries. Because I am only going through all of my data once, and building the dictionaries where appropriate, it is probably much more efficient than using comprehension, which is essentially multiple loops. As I said, this is more of an exercise to learn how to accomplish the same task with different methods.

答案1

得分: 3

以下是代码的中文翻译部分:

  1. # 使用字典推导式版本:
  2. out = {band: dict(Counter(v['COUNTRY'] for v in g)) for band, g in groupby(sorted(data, key=lambda k: k['BAND']), lambda k: k['BAND'])}
  3. # 可以结合 itertools.groupby 和 collections.Counter:
  4. from itertools import groupby
  5. from collections import Counter
  6. s = sorted(data, key=lambda k: k['BAND'])
  7. out = {}
  8. for band, g in groupby(s, lambda k: k['BAND']):
  9. c = Counter(v['COUNTRY'] for v in g)
  10. out[band] = dict(c)
  11. # 不使用模块的版本:
  12. out = {}
  13. for i in data:
  14. out.setdefault(i['BAND'], {}).setdefault(i['COUNTRY'], 0)
  15. out[i['BAND']][i['COUNTRY']] += 1
  16. # 基准测试:
  17. from timeit import timeit
  18. from itertools import groupby
  19. from collections import Counter
  20. def sol_orig():
  21. worked_dxcc_by_band = {z["BAND"]: {x["COUNTRY"] : len([y["COUNTRY"] for y in data if y["COUNTRY"] == x["COUNTRY"] and y["BAND"] == z["BAND"]]) for x in data if x["BAND"] == z["BAND"]} for z in data}
  22. return worked_dxcc_by_band
  23. def solution():
  24. out = {band: dict(Counter(v['COUNTRY'] for v in g)) for band, g in groupby(sorted(data, key=lambda k: k['BAND']), lambda k: k['BAND'])}
  25. return out
  26. def solution_2():
  27. out = {}
  28. for i in data:
  29. out.setdefault(i['BAND'], {}).setdefault(i['COUNTRY'], 0)
  30. out[i['BAND']][i['COUNTRY']] += 1
  31. return out
  32. t1 = timeit(lambda: solution(), number=10000)
  33. t2 = timeit(lambda: solution_2(), number=10000)
  34. t3 = timeit(lambda: sol_orig(), number=10000)
  35. print(t1)
  36. print(t2)
  37. print(t3)

请注意,这只是代码的翻译部分,不包括任何问题的回答。

英文:

EDIT: Dictionary comprehension version:

  1. out = {band: dict(Counter(v['COUNTRY'] for v in g)) for band, g in groupby(sorted(data, key=lambda k: k['BAND']), lambda k: k['BAND'])}

You can combine itertools.groupby and collections.Counter:

  1. from itertools import groupby
  2. from collections import Counter
  3. s = sorted(data, key=lambda k: k['BAND'])
  4. out = {}
  5. for band, g in groupby(s, lambda k: k['BAND']):
  6. c = Counter(v['COUNTRY'] for v in g)
  7. out[band] = dict(c)
  8. from pprint import pprint
  9. pprint(out)

Prints:

  1. {'10M': {'ARGENTINA': 2,
  2. 'BRAZIL': 2,
  3. 'FRENCH GUIANA': 1,
  4. 'PORTUGAL': 1,
  5. 'UNITED STATES OF AMERICA': 1},
  6. '17M': {'ENGLAND': 1, 'SCOTLAND': 1, 'UNITED STATES OF AMERICA': 1},
  7. '20M': {'EUROPEAN RUSSIA': 1, 'IRELAND': 1, 'UNITED STATES OF AMERICA': 5},
  8. '40M': {'ARGENTINA': 1, 'CANARY ISLANDS': 1, 'SPAIN': 1}}

EDIT: Without modules:

  1. out = {}
  2. for i in data:
  3. out.setdefault(i['BAND'], {}).setdefault(i['COUNTRY'], 0)
  4. out[i['BAND']][i['COUNTRY']] += 1
  5. from pprint import pprint
  6. pprint(out)

Benchmark:

  1. from timeit import timeit
  2. from itertools import groupby
  3. from collections import Counter
  4. def sol_orig():
  5. worked_dxcc_by_band = {z["BAND"]: {x["COUNTRY"] : len([y["COUNTRY"] for y in data if y["COUNTRY"] == x["COUNTRY"] and y["BAND"] == z["BAND"]]) for x in data if x["BAND"] == z["BAND"]} for z in data}
  6. return worked_dxcc_by_band
  7. def solution():
  8. out = {band: dict(Counter(v['COUNTRY'] for v in g)) for band, g in groupby(sorted(data, key=lambda k: k['BAND']), lambda k: k['BAND'])}
  9. return out
  10. def solution_2():
  11. out = {}
  12. for i in data:
  13. out.setdefault(i['BAND'], {}).setdefault(i['COUNTRY'], 0)
  14. out[i['BAND']][i['COUNTRY']] += 1
  15. return out
  16. t1 = timeit(lambda: solution(), number=10000)
  17. t2 = timeit(lambda: solution_2(), number=10000)
  18. t3 = timeit(lambda: sol_orig(), number=10000)
  19. print(t1)
  20. print(t2)
  21. print(t3)

Prints:

  1. 0.18113317096140236
  2. 0.08159565401729196
  3. 3.5367472909856588

huangapple
  • 本文由 发表于 2020年1月7日 00:41:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615835.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定