2023年2月8日 13:12:40go评论95阅读模式

英文:

how to imitate Pandas' index-based querying in Polars?

问题

# 使用 Polars 模仿下面的 Pandas 代码，但保持相同的顺序
df = pl.DataFrame(data=([21, 123], [132, 412], [23, 43]), schema=['c1', 'c2'])
rows_to_select = [23, 132]
selected_rows = df.filter(pl.col('c1').is_in(rows_to_select))
selected_rows = selected_rows.with_column(pl.when(pl.col('c1') == rows_to_select[0], 1).otherwise(0).alias('order'))
selected_rows = selected_rows.sort('order', reverse=True).drop_column('order')
print(selected_rows)

英文:

Any idea what I can do to imitate the below pandas code using polars? Polars doesn't have indexes like pandas so I couldn't figure out what I can do .

df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = [&#39;c1&#39;, &#39;c2&#39;]).set_index(&quot;c1&quot;)
print(df.loc[[23, 132]])

and it prints

c1	c2
23	43
132	412

the only polars conversion I could figure out to do is

df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = [&#39;c1&#39;, &#39;c2&#39;])
print(df.filter(pl.col(&quot;c1&quot;).is_in([23, 132])))

but it prints

c1	c2
132	412
23	43

which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas' output has.

I can use a sort() later yes, but the original data I use this on has like 30Million rows so I'm looking for something that's as fast as possible.

答案1

得分: 6

我建议使用left join来实现这个目标。这将保持与你的索引值列表相对应的顺序。而且这种方法性能很好。

例如，让我们从这个打乱顺序的DataFrame开始。

nbr_rows = 30_000_000
df = pl.DataFrame({
    'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
    'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df

shape: (30000000, 2)
┌──────────┬──────────┐
│ c1       ┆ c2       │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 4052015  ┆ 20642741 │
│ 7787054  ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992  ┆ 6495751  │
│ ...      ┆ ...      │
│ 10371090 ┆ 4791782  │
│ 26281644 ┆ 12350777 │
│ 6740626  ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘

以及这些索引值：

nbr_index_values = 10_000
s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1

shape: (10000,)
Series: 'c1' [i64]
[
        1754
        6716
        3485
        7058
        7216
        1040
        1832
        3921
        1639
        6734
        5560
        7596
        ...
        4243
        4455
        894
        7806
        9291
        1883
        9947
        3309
        2030
        7731
        4706
        8528
        8426
]

现在，我们执行一个left join来获取与索引值对应的行。（请注意，索引值列表是这个连接中的左侧DataFrame。）

start = time.perf_counter()
df2 = (
    s1.to_frame()
    .join(
        df,
        on='c1',
        how='left'
    )
)
print(time.perf_counter() - start)
df2

>>> print(time.perf_counter() - start)
0.8427023889998964

shape: (10000, 2)
┌──────┬──────────┐
│ c1   ┆ c2       │
│ ---  ┆ ---      │
│ i64  ┆ i64      │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ...  ┆ ...      │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904  │
│ 8426 ┆ 18927935 │
└──────┴──────────┘

注意行的顺序与索引值相同。我们可以验证这一点：

s1.series_equal(df2.get_column('c1'), strict=True)

>>> s1.series_equal(df2.get_column('c1'), strict=True)
True

而且性能相当不错。在我的32核系统上，这不到一秒钟。

英文:

I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)

For example, let's start with this shuffled DataFrame.

nbr_rows = 30_000_000
df = pl.DataFrame({
    &#39;c1&#39;: pl.arange(0, nbr_rows, eager=True).shuffle(2),
    &#39;c2&#39;: pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df

shape: (30000000, 2)
┌──────────┬──────────┐
│ c1       ┆ c2       │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 4052015  ┆ 20642741 │
│ 7787054  ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992  ┆ 6495751  │
│ ...      ┆ ...      │
│ 10371090 ┆ 4791782  │
│ 26281644 ┆ 12350777 │
│ 6740626  ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘

And these index values:

nbr_index_values = 10_000
s1 = pl.Series(name=&#39;c1&#39;, values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1

shape: (10000,)
Series: &#39;c1&#39; [i64]
[
        1754
        6716
        3485
        7058
        7216
        1040
        1832
        3921
        1639
        6734
        5560
        7596
        ...
        4243
        4455
        894
        7806
        9291
        1883
        9947
        3309
        2030
        7731
        4706
        8528
        8426
]

We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)

start = time.perf_counter()
df2 = (
    s1.to_frame()
    .join(
        df,
        on=&#39;c1&#39;,
        how=&#39;left&#39;
    )
)
print(time.perf_counter() - start)
df2

&gt;&gt;&gt; print(time.perf_counter() - start)
0.8427023889998964

shape: (10000, 2)
┌──────┬──────────┐
│ c1   ┆ c2       │
│ ---  ┆ ---      │
│ i64  ┆ i64      │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ...  ┆ ...      │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904  │
│ 8426 ┆ 18927935 │
└──────┴──────────┘

Notice how the rows are in the same order as the index values. We can verify this:

s1.series_equal(df2.get_column(&#39;c1&#39;), strict=True)

&gt;&gt;&gt; s1.series_equal(df2.get_column(&#39;c1&#39;), strict=True)
True

And the performance is quite good. On my 32-core system, this takes less than a second.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Polars中模仿Pandas的基于索引的查询？

问题

答案1

更改图形层叠顺序的方法

如何使用值对具有多个键并且每个键的值都是一个字典的字典进行排序？

如何获取pytest装饰的测试函数名称和参数在pytest的装置中。

Transforming annotated csv (influxdb) to normal csv file using python script

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。