2023年5月10日 19:27:16go评论78阅读模式

英文:

What is the fastest way to do "indexed" look-ups in Polars?

问题

我正在使用大型的 Polars 数据框，它们完全加载到内存中。每一行都由 entityId（Int64）和 entryDate（日期）列唯一索引。

我知道 Polars 不支持索引，但我仍然需要针对这些表执行临时数据查找操作，而且这种操作频率足够高，已经占用了我的应用程序运行时间的相当大部分。

目前，我是通过使用 .filter 来查找这些行的。

def locate(df, entityId, entryDate) -> pl.DataFrame:
    return df.filter(pl.col('entityId') == entityId).filter(pl.col('entryDate') == entryDate)

这相当快，但仍然需要 50 到 100 毫秒来查找一行。

是否有任何我可能忽略的优化方法？

我尝试过的一些方法：

使用 .lazy / .collect（没有改变）
按 entityId 进行排序（没有改变）

我使用的是 Polars 0.17.12 版本。

英文:

I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).

I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.

Currently I am finding these rows by using .filter

def locate(df, entityId, entryDate)-&gt;pl.DataFrame:
    return df.filter(pl.col(&#39;entityId&#39;)==entityId).filter(pl.col(&#39;entryDate&#39;) == entryDate)

This is fairly fast, but it still takes 50 to 100ms to look up a row.

Are there any optimizations I'm missing?

Some things I tried:

using .lazy / .collect (no change)
sorting by entityId (no change)

I'm on polars 0.17.12

答案1

得分: 2

我将为您翻译代码部分：

size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
                'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                'blah':np.random.random(size=size)})
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.45 ms ± 14 µs per loop
这是性能基准。
第一个改进是将筛选器合并而不是链式使用...
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
996 µs ± 16.6 µs per loop
现在让我们尝试对列进行排序...
df=df.sort(['entityId', 'entryDate'])
排序后...
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.33 ms ± 23.7 µs per loop
几乎与最初的1.45ms相差无几
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
1.04 ms ± 35.5 µs per loop
不确定为什么平均时间较长，尽管它们在1个标准差内重叠，基本上是相同的。
现在让我们尝试将两列制作成一个名为INDEX的结构列...
df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop
这是轻微的改进。
让我们尝试按`entityId`对数据框进行分区...
dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
445 µs ± 6.51 µs per loop
这是一个大的改进！现在尝试另一种分区方式
dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
382 µs ± 7.23 µs per loop
`search_sorted`技巧仅在您想获取单行时才有效，在我的随机数据中它返回了多行。但是，如果您只想获取一行，您可以这样做
%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
333 µs ± 8.31 µs per loop
在`search_sorted`技巧中有轻微的改进，但不像[这里](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)那样大。
我猜测自从发布了那个答案以来，`filter`已经变得更加高效，这可能是为什么没有太大改进的原因之一。在这种情况下，改进可能仅仅是因为它只返回一行。
**总结**
如果您有足够的内存，那么按列中的一个进行分区，尝试两种方式，因为大小将影响性能。`search_sorted`技巧仅在数字上有效，不适用于日期，因此只能在按日期字段分区时可用。如果您没有内存进行分区，那么将列组合成排序的结构，同时将两个筛选表达式放在一次`filter`调用中，那应该是下一个最佳选择。您可以轻松使用`unnest('INDEX')`来恢复原始形状。
**另外**
如果您可以将正在进行的操作重构成`groupby.agg`和/或连接，那么这可能会更好，特别是由于并行化的原因。
<details>
<summary>英文:</summary>
I&#39;m going to start with
    size=1000000
    df=pl.DataFrame({&#39;entityId&#39;:np.random.randint(0,200,size=size),
                    &#39;entryDate&#39;:[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                    &#39;blah&#39;:np.random.random(size=size)})
    %%timeit
    df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
    1.45 ms &#177; 14 &#181;s per loop
That&#39;s the benchmark.
The first improvement you can make is to combine your filters instead of chaining them...
    %%timeit
    df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
    996 &#181;s &#177; 16.6 &#181;s per loop
Now let&#39;s try to sort the columns...
    df=df.sort([&#39;entityId&#39;, &#39;entryDate&#39;])
After sorting...
    %%timeit
    df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
    1.33 ms &#177; 23.7 &#181;s per loop
barely an improvement over the initial 1.45ms
    %%timeit
    df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
    1.04 ms &#177; 35.5 &#181;s per loop
not sure why that took longer on average although they overlap in 1 std dev so basically the same.
now let&#39;s try making a struct column out of the two columns which we&#39;ll call INDEX
    df=df.select(pl.struct([&#39;entityId&#39;, &#39;entryDate&#39;]).alias(&#39;INDEX&#39;), pl.exclude([&#39;entityId&#39;, &#39;entryDate&#39;])).sort(&#39;INDEX&#39;)
    %%timeit
    df.filter((pl.col(&#39;INDEX&#39;).struct.field(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;INDEX&#39;).struct.field(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
    946 &#181;s &#177; 18.4 &#181;s per loop
    
That&#39;s a slight improvement.
Let&#39;s try partitioning the df by `entityId`
    dfdicts=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entityId&#39;,as_dict=True)
    %%timeit
    dfdicts[34].filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())
    445 &#181;s &#177; 6.51 &#181;s per loop
That&#39;s a big improvement!! Let&#39;s try partitioning the other way
    dfdicts2=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entryDate&#39;,as_dict=True)
    %%timeit
    dfdicts2[datetime(2022,2,16).date()].filter(pl.col(&#39;entityId&#39;)==34)
    382 &#181;s &#177; 7.23 &#181;s per loop
The `search_sorted` trick only works when you want to get a single row and in my random data it&#39;s been returning multiple rows. However if you&#39;re only looking to get a single row you could do
    %%timeit
    startrow=dfdicts2[datetime(2022,2,16).date()].get_column(&#39;entityId&#39;).search_sorted(34, side=&#39;left&#39;)
    dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
    333 &#181;s &#177; 8.31 &#181;s per loop
There&#39;s a slight improvement in the `search_sorted` trick against the regular filter but not as big as [here](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)
I&#39;m guessing that since that answer was posted that `filter` was made to be much more efficient so that&#39;s one reason why there&#39;s not much of an improvement. In this case the improvement might just be because it&#39;s only returning one row.
**TL;DR**
If you&#39;ve got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The `search_sorted` trick will only work on numbers, not dates, so it&#39;ll only be available if you partition by the date field. If you don&#39;t have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily `unnest(&#39;INDEX&#39;)` to get the original shape back.
**As an aside**
If you can refactor whatever you&#39;re doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Polars中进行“indexed”查找的最快方法是什么？

问题

答案1

如何按列表中的任何正则表达式模式匹配任何字符串列来过滤DataFrame？

向量中的R索引变量未按预期工作。

是否可以从MySQL的复合索引中删除一列？

Python Polars – Create sequence in list from integer

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。