英文:
What is the fastest way to do "indexed" look-ups in Polars?
问题
我正在使用大型的 Polars 数据框,它们完全加载到内存中。每一行都由 entityId(Int64)和 entryDate(日期)列唯一索引。
我知道 Polars 不支持索引,但我仍然需要针对这些表执行临时数据查找操作,而且这种操作频率足够高,已经占用了我的应用程序运行时间的相当大部分。
目前,我是通过使用 .filter
来查找这些行的。
def locate(df, entityId, entryDate) -> pl.DataFrame:
return df.filter(pl.col('entityId') == entityId).filter(pl.col('entryDate') == entryDate)
这相当快,但仍然需要 50 到 100 毫秒来查找一行。
是否有任何我可能忽略的优化方法?
我尝试过的一些方法:
- 使用
.lazy
/.collect
(没有改变) - 按 entityId 进行排序(没有改变)
我使用的是 Polars 0.17.12 版本。
英文:
I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).
I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.
Currently I am finding these rows by using .filter
def locate(df, entityId, entryDate)->pl.DataFrame:
return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)
This is fairly fast, but it still takes 50 to 100ms to look up a row.
Are there any optimizations I'm missing?
Some things I tried:
- using
.lazy
/.collect
(no change) - sorting by entityId (no change)
I'm on polars 0.17.12
答案1
得分: 2
我将为您翻译代码部分:
size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
'blah':np.random.random(size=size)})
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.45 ms ± 14 µs per loop
这是性能基准。
第一个改进是将筛选器合并而不是链式使用...
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
996 µs ± 16.6 µs per loop
现在让我们尝试对列进行排序...
df=df.sort(['entityId', 'entryDate'])
排序后...
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.33 ms ± 23.7 µs per loop
几乎与最初的1.45ms相差无几
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
1.04 ms ± 35.5 µs per loop
不确定为什么平均时间较长,尽管它们在1个标准差内重叠,基本上是相同的。
现在让我们尝试将两列制作成一个名为INDEX的结构列...
df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop
这是轻微的改进。
让我们尝试按`entityId`对数据框进行分区...
dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
445 µs ± 6.51 µs per loop
这是一个大的改进!现在尝试另一种分区方式
dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
382 µs ± 7.23 µs per loop
`search_sorted`技巧仅在您想获取单行时才有效,在我的随机数据中它返回了多行。但是,如果您只想获取一行,您可以这样做
%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
333 µs ± 8.31 µs per loop
在`search_sorted`技巧中有轻微的改进,但不像[这里](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)那样大。
我猜测自从发布了那个答案以来,`filter`已经变得更加高效,这可能是为什么没有太大改进的原因之一。在这种情况下,改进可能仅仅是因为它只返回一行。
**总结**
如果您有足够的内存,那么按列中的一个进行分区,尝试两种方式,因为大小将影响性能。`search_sorted`技巧仅在数字上有效,不适用于日期,因此只能在按日期字段分区时可用。如果您没有内存进行分区,那么将列组合成排序的结构,同时将两个筛选表达式放在一次`filter`调用中,那应该是下一个最佳选择。您可以轻松使用`unnest('INDEX')`来恢复原始形状。
**另外**
如果您可以将正在进行的操作重构成`groupby.agg`和/或连接,那么这可能会更好,特别是由于并行化的原因。
<details>
<summary>英文:</summary>
I'm going to start with
size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
'blah':np.random.random(size=size)})
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.45 ms ± 14 µs per loop
That's the benchmark.
The first improvement you can make is to combine your filters instead of chaining them...
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
996 µs ± 16.6 µs per loop
Now let's try to sort the columns...
df=df.sort(['entityId', 'entryDate'])
After sorting...
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.33 ms ± 23.7 µs per loop
barely an improvement over the initial 1.45ms
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
1.04 ms ± 35.5 µs per loop
not sure why that took longer on average although they overlap in 1 std dev so basically the same.
now let's try making a struct column out of the two columns which we'll call INDEX
df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop
That's a slight improvement.
Let's try partitioning the df by `entityId`
dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
445 µs ± 6.51 µs per loop
That's a big improvement!! Let's try partitioning the other way
dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
382 µs ± 7.23 µs per loop
The `search_sorted` trick only works when you want to get a single row and in my random data it's been returning multiple rows. However if you're only looking to get a single row you could do
%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
333 µs ± 8.31 µs per loop
There's a slight improvement in the `search_sorted` trick against the regular filter but not as big as [here](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)
I'm guessing that since that answer was posted that `filter` was made to be much more efficient so that's one reason why there's not much of an improvement. In this case the improvement might just be because it's only returning one row.
**TL;DR**
If you've got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The `search_sorted` trick will only work on numbers, not dates, so it'll only be available if you partition by the date field. If you don't have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily `unnest('INDEX')` to get the original shape back.
**As an aside**
If you can refactor whatever you're doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论