在Polars中进行“indexed”查找的最快方法是什么?

huangapple go评论78阅读模式
英文:

What is the fastest way to do "indexed" look-ups in Polars?

问题

我正在使用大型的 Polars 数据框,它们完全加载到内存中。每一行都由 entityId(Int64)和 entryDate(日期)列唯一索引。

我知道 Polars 不支持索引,但我仍然需要针对这些表执行临时数据查找操作,而且这种操作频率足够高,已经占用了我的应用程序运行时间的相当大部分。

目前,我是通过使用 .filter 来查找这些行的。

  1. def locate(df, entityId, entryDate) -> pl.DataFrame:
  2. return df.filter(pl.col('entityId') == entityId).filter(pl.col('entryDate') == entryDate)

这相当快,但仍然需要 50 到 100 毫秒来查找一行。

是否有任何我可能忽略的优化方法?

我尝试过的一些方法:

  1. 使用 .lazy / .collect(没有改变)
  2. 按 entityId 进行排序(没有改变)

我使用的是 Polars 0.17.12 版本。

英文:

I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).

I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.

Currently I am finding these rows by using .filter

  1. def locate(df, entityId, entryDate)->pl.DataFrame:
  2. return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)

This is fairly fast, but it still takes 50 to 100ms to look up a row.

Are there any optimizations I'm missing?

Some things I tried:

  1. using .lazy / .collect (no change)
  2. sorting by entityId (no change)

I'm on polars 0.17.12

答案1

得分: 2

我将为您翻译代码部分:

  1. size=1000000
  2. df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
  3. 'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
  4. 'blah':np.random.random(size=size)})
  5. %%timeit
  6. df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
  7. 1.45 ms ± 14 µs per loop
  8. 这是性能基准
  9. 第一个改进是将筛选器合并而不是链式使用...
  10. %%timeit
  11. df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
  12. 996 µs ± 16.6 µs per loop
  13. 现在让我们尝试对列进行排序...
  14. df=df.sort(['entityId', 'entryDate'])
  15. 排序后...
  16. %%timeit
  17. df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
  18. 1.33 ms ± 23.7 µs per loop
  19. 几乎与最初的1.45ms相差无几
  20. %%timeit
  21. df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
  22. 1.04 ms ± 35.5 µs per loop
  23. 不确定为什么平均时间较长尽管它们在1个标准差内重叠基本上是相同的
  24. 现在让我们尝试将两列制作成一个名为INDEX的结构列...
  25. df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
  26. %%timeit
  27. df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
  28. 946 µs ± 18.4 µs per loop
  29. 这是轻微的改进
  30. 让我们尝试按`entityId`对数据框进行分区...
  31. dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
  32. %%timeit
  33. dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
  34. 445 µs ± 6.51 µs per loop
  35. 这是一个大的改进现在尝试另一种分区方式
  36. dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
  37. %%timeit
  38. dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
  39. 382 µs ± 7.23 µs per loop
  40. `search_sorted`技巧仅在您想获取单行时才有效在我的随机数据中它返回了多行但是如果您只想获取一行您可以这样做
  41. %%timeit
  42. startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
  43. dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
  44. 333 µs ± 8.31 µs per loop
  45. `search_sorted`技巧中有轻微的改进但不像[这里](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)那样大。
  46. 我猜测自从发布了那个答案以来`filter`已经变得更加高效这可能是为什么没有太大改进的原因之一在这种情况下改进可能仅仅是因为它只返回一行
  47. **总结**
  48. 如果您有足够的内存那么按列中的一个进行分区尝试两种方式因为大小将影响性能`search_sorted`技巧仅在数字上有效不适用于日期因此只能在按日期字段分区时可用如果您没有内存进行分区那么将列组合成排序的结构同时将两个筛选表达式放在一次`filter`调用中那应该是下一个最佳选择您可以轻松使用`unnest('INDEX')`来恢复原始形状
  49. **另外**
  50. 如果您可以将正在进行的操作重构成`groupby.agg`/或连接那么这可能会更好特别是由于并行化的原因
  51. <details>
  52. <summary>英文:</summary>
  53. I&#39;m going to start with
  54. size=1000000
  55. df=pl.DataFrame({&#39;entityId&#39;:np.random.randint(0,200,size=size),
  56. &#39;entryDate&#39;:[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
  57. &#39;blah&#39;:np.random.random(size=size)})
  58. %%timeit
  59. df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
  60. 1.45 ms &#177; 14 &#181;s per loop
  61. That&#39;s the benchmark.
  62. The first improvement you can make is to combine your filters instead of chaining them...
  63. %%timeit
  64. df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
  65. 996 &#181;s &#177; 16.6 &#181;s per loop
  66. Now let&#39;s try to sort the columns...
  67. df=df.sort([&#39;entityId&#39;, &#39;entryDate&#39;])
  68. After sorting...
  69. %%timeit
  70. df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
  71. 1.33 ms &#177; 23.7 &#181;s per loop
  72. barely an improvement over the initial 1.45ms
  73. %%timeit
  74. df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
  75. 1.04 ms &#177; 35.5 &#181;s per loop
  76. not sure why that took longer on average although they overlap in 1 std dev so basically the same.
  77. now let&#39;s try making a struct column out of the two columns which we&#39;ll call INDEX
  78. df=df.select(pl.struct([&#39;entityId&#39;, &#39;entryDate&#39;]).alias(&#39;INDEX&#39;), pl.exclude([&#39;entityId&#39;, &#39;entryDate&#39;])).sort(&#39;INDEX&#39;)
  79. %%timeit
  80. df.filter((pl.col(&#39;INDEX&#39;).struct.field(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;INDEX&#39;).struct.field(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
  81. 946 &#181;s &#177; 18.4 &#181;s per loop
  82. That&#39;s a slight improvement.
  83. Let&#39;s try partitioning the df by `entityId`
  84. dfdicts=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entityId&#39;,as_dict=True)
  85. %%timeit
  86. dfdicts[34].filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
  87. 445 &#181;s &#177; 6.51 &#181;s per loop
  88. That&#39;s a big improvement!! Let&#39;s try partitioning the other way
  89. dfdicts2=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entryDate&#39;,as_dict=True)
  90. %%timeit
  91. dfdicts2[datetime(2022,2,16).date()].filter(pl.col(&#39;entityId&#39;)==34)
  92. 382 &#181;s &#177; 7.23 &#181;s per loop
  93. The `search_sorted` trick only works when you want to get a single row and in my random data it&#39;s been returning multiple rows. However if you&#39;re only looking to get a single row you could do
  94. %%timeit
  95. startrow=dfdicts2[datetime(2022,2,16).date()].get_column(&#39;entityId&#39;).search_sorted(34, side=&#39;left&#39;)
  96. dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
  97. 333 &#181;s &#177; 8.31 &#181;s per loop
  98. There&#39;s a slight improvement in the `search_sorted` trick against the regular filter but not as big as [here](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)
  99. I&#39;m guessing that since that answer was posted that `filter` was made to be much more efficient so that&#39;s one reason why there&#39;s not much of an improvement. In this case the improvement might just be because it&#39;s only returning one row.
  100. **TL;DR**
  101. If you&#39;ve got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The `search_sorted` trick will only work on numbers, not dates, so it&#39;ll only be available if you partition by the date field. If you don&#39;t have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily `unnest(&#39;INDEX&#39;)` to get the original shape back.
  102. **As an aside**
  103. If you can refactor whatever you&#39;re doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.
  104. </details>

huangapple
  • 本文由 发表于 2023年5月10日 19:27:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76217842.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定