在Polars中进行“indexed”查找的最快方法是什么?

huangapple go评论55阅读模式
英文:

What is the fastest way to do "indexed" look-ups in Polars?

问题

我正在使用大型的 Polars 数据框,它们完全加载到内存中。每一行都由 entityId(Int64)和 entryDate(日期)列唯一索引。

我知道 Polars 不支持索引,但我仍然需要针对这些表执行临时数据查找操作,而且这种操作频率足够高,已经占用了我的应用程序运行时间的相当大部分。

目前,我是通过使用 .filter 来查找这些行的。

def locate(df, entityId, entryDate) -> pl.DataFrame:
    return df.filter(pl.col('entityId') == entityId).filter(pl.col('entryDate') == entryDate)

这相当快,但仍然需要 50 到 100 毫秒来查找一行。

是否有任何我可能忽略的优化方法?

我尝试过的一些方法:

  1. 使用 .lazy / .collect(没有改变)
  2. 按 entityId 进行排序(没有改变)

我使用的是 Polars 0.17.12 版本。

英文:

I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).

I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.

Currently I am finding these rows by using .filter

def locate(df, entityId, entryDate)->pl.DataFrame:
    return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)

This is fairly fast, but it still takes 50 to 100ms to look up a row.

Are there any optimizations I'm missing?

Some things I tried:

  1. using .lazy / .collect (no change)
  2. sorting by entityId (no change)

I'm on polars 0.17.12

答案1

得分: 2

我将为您翻译代码部分:

size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
                'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                'blah':np.random.random(size=size)})

%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())

1.45 ms ± 14 µs per loop

这是性能基准

第一个改进是将筛选器合并而不是链式使用...

%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))

996 µs ± 16.6 µs per loop

现在让我们尝试对列进行排序...

df=df.sort(['entityId', 'entryDate'])

排序后...

%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())

1.33 ms ± 23.7 µs per loop

几乎与最初的1.45ms相差无几

%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))

1.04 ms ± 35.5 µs per loop

不确定为什么平均时间较长尽管它们在1个标准差内重叠基本上是相同的

现在让我们尝试将两列制作成一个名为INDEX的结构列...

df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')

%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop

这是轻微的改进

让我们尝试按`entityId`对数据框进行分区...

dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)

%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())

445 µs ± 6.51 µs per loop

这是一个大的改进现在尝试另一种分区方式

dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)

%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)

382 µs ± 7.23 µs per loop

`search_sorted`技巧仅在您想获取单行时才有效在我的随机数据中它返回了多行但是如果您只想获取一行您可以这样做

%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)

333 µs ± 8.31 µs per loop

`search_sorted`技巧中有轻微的改进但不像[这里](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)那样大。

我猜测自从发布了那个答案以来,`filter`已经变得更加高效这可能是为什么没有太大改进的原因之一在这种情况下改进可能仅仅是因为它只返回一行

**总结**

如果您有足够的内存那么按列中的一个进行分区尝试两种方式因为大小将影响性能。`search_sorted`技巧仅在数字上有效不适用于日期因此只能在按日期字段分区时可用如果您没有内存进行分区那么将列组合成排序的结构同时将两个筛选表达式放在一次`filter`调用中那应该是下一个最佳选择您可以轻松使用`unnest('INDEX')`来恢复原始形状

**另外**
如果您可以将正在进行的操作重构成`groupby.agg`/或连接那么这可能会更好特别是由于并行化的原因

<details>
<summary>英文:</summary>

I&#39;m going to start with


    size=1000000
    df=pl.DataFrame({&#39;entityId&#39;:np.random.randint(0,200,size=size),
                    &#39;entryDate&#39;:[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                    &#39;blah&#39;:np.random.random(size=size)})

    %%timeit
    df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())

    1.45 ms &#177; 14 &#181;s per loop

That&#39;s the benchmark.

The first improvement you can make is to combine your filters instead of chaining them...

    %%timeit
    df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))

    996 &#181;s &#177; 16.6 &#181;s per loop

Now let&#39;s try to sort the columns...

    df=df.sort([&#39;entityId&#39;, &#39;entryDate&#39;])

After sorting...

    %%timeit
    df.filter(pl.col(&#39;entityId&#39;)==34).filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())

    1.33 ms &#177; 23.7 &#181;s per loop

barely an improvement over the initial 1.45ms

    %%timeit
    df.filter((pl.col(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date()))

    1.04 ms &#177; 35.5 &#181;s per loop

not sure why that took longer on average although they overlap in 1 std dev so basically the same.

now let&#39;s try making a struct column out of the two columns which we&#39;ll call INDEX

    df=df.select(pl.struct([&#39;entityId&#39;, &#39;entryDate&#39;]).alias(&#39;INDEX&#39;), pl.exclude([&#39;entityId&#39;, &#39;entryDate&#39;])).sort(&#39;INDEX&#39;)


    %%timeit
    df.filter((pl.col(&#39;INDEX&#39;).struct.field(&#39;entityId&#39;)==34) &amp; (pl.col(&#39;INDEX&#39;).struct.field(&#39;entryDate&#39;)==datetime(2022,2,16).date()))
    946 &#181;s &#177; 18.4 &#181;s per loop
    

That&#39;s a slight improvement.

Let&#39;s try partitioning the df by `entityId`

    dfdicts=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entityId&#39;,as_dict=True)

    %%timeit
    dfdicts[34].filter(pl.col(&#39;entryDate&#39;)==datetime(2022,2,16).date())

    445 &#181;s &#177; 6.51 &#181;s per loop

That&#39;s a big improvement!! Let&#39;s try partitioning the other way

    dfdicts2=df.unnest(&#39;INDEX&#39;).partition_by(&#39;entryDate&#39;,as_dict=True)

    %%timeit
    dfdicts2[datetime(2022,2,16).date()].filter(pl.col(&#39;entityId&#39;)==34)

    382 &#181;s &#177; 7.23 &#181;s per loop

The `search_sorted` trick only works when you want to get a single row and in my random data it&#39;s been returning multiple rows. However if you&#39;re only looking to get a single row you could do

    %%timeit
    startrow=dfdicts2[datetime(2022,2,16).date()].get_column(&#39;entityId&#39;).search_sorted(34, side=&#39;left&#39;)
    dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)

    333 &#181;s &#177; 8.31 &#181;s per loop

There&#39;s a slight improvement in the `search_sorted` trick against the regular filter but not as big as [here](https://stackoverflow.com/questions/75169981/convert-huge-polars-dataframe-to-dict-without-consuming-too-much-ram/75177382#75177382)

I&#39;m guessing that since that answer was posted that `filter` was made to be much more efficient so that&#39;s one reason why there&#39;s not much of an improvement. In this case the improvement might just be because it&#39;s only returning one row.

**TL;DR**

If you&#39;ve got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The `search_sorted` trick will only work on numbers, not dates, so it&#39;ll only be available if you partition by the date field. If you don&#39;t have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily `unnest(&#39;INDEX&#39;)` to get the original shape back.

**As an aside**
If you can refactor whatever you&#39;re doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.

</details>



huangapple
  • 本文由 发表于 2023年5月10日 19:27:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76217842.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定