2023年7月20日 11:38:42go评论111阅读模式

英文:

How to take head(n) and tail(n) in one groupby with polars

问题

这里是一个示例数据框，我想要在每一天中使用groupby("date").agg()获取前n行和后n行。我知道可以使用两个groupby来获取前部分和后部分，然后将它们连接起来。是否有一种在一个groupby或with_columns或filter中执行此操作的技巧？

英文:

Here is a sample dataframe, I want to get head(n) rows and tail(n) rows in everyday which is a groupby("date").agg() with polars. I know it could use two groupby to get head and tail part and then concat them. Is there some trick to do this in one groupby or with_columns or filter?

答案1

得分: 3

这是一个通用的方法，用于获取每个组的 head(n) 和 tail(n) 并将其合并到最终的DataFrame中，无需使用 concat 操作，示例中使用一个简单的DataFrame。

它使用 filter 来从每个组中移除中间的行，使用 cumcount 和 count 进行适当的计算，使用 over 在原地进行分组聚合：

N = 50
df = pl.DataFrame(
    {
        "group": [1] * N + [2] * N,
        "value": range(1, 2 * N + 1),
    }
)
n = 2
df.filter(
    ~pl.col('group')
    .cumcount()
    .over('group')
    .is_between(n, pl.col('group').count().over('group') - n, closed='left')
)

shape: (8, 2)
┌───────┬───────┐
│ group │ value │
│ ---   │ ---   │
│ i64   │ i64   │
╞═══════╪═══════╡
│ 1     │ 1     │
│ 1     │ 2     │
│ 1     │ 49    │
│ 1     │ 50    │
│ 2     │ 51    │
│ 2     │ 52    │
│ 2     │ 99    │
│ 2     │ 100   │
└───────┴───────┘

您可以调整 n 的值，看到它适用于所有非负的 n。

英文:

Here's a general way to get the head(n) and tail(n) per group into a final DataFrame, without concat shenanigans, and using a trivial df as an example.

It uses filter to take out the middle rows over each group, with appropriate calculations being done with cumcount and count, with over doing the groupby-aggregations in place:

N = 50
df = pl.DataFrame(
    {
        &quot;group&quot;: [1] * N + [2] * N,
        &quot;value&quot;: range(1, 2 * N + 1),
    }
)
n = 2
df.filter(
    ~pl.col(&#39;group&#39;)
    .cumcount()
    .over(&#39;group&#39;)
    .is_between(n, pl.col(&#39;group&#39;).count().over(&#39;group&#39;) - n, closed=&#39;left&#39;)
)

shape: (8, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 1     │
│ 1     ┆ 2     │
│ 1     ┆ 49    │
│ 1     ┆ 50    │
│ 2     ┆ 51    │
│ 2     ┆ 52    │
│ 2     ┆ 99    │
│ 2     ┆ 100   │
└───────┴───────┘

You can adjust n and see that it works for all non-negative n.

答案2

得分: 1

@Wayoshi 已经提交了一个答案，你已经接受了，但是在你希望结果的格式方面存在一些不确定性。因此，在这里，我提交了几种不同的解决方案形式，与@Wayoshi的示例数据类似，这可能对其他人有用。

创建示例数据：

import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
    {
        &quot;group&quot;: np.repeat(range(groups), group_size),
        &quot;value_a&quot;: range(groups * group_size),
        &quot;value_b&quot;: range(0, groups * group_size ** 2, group_size),
    }
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ ---   ┆ ---     ┆ ---     │
# │ i64   ┆ i64     ┆ i64     │
# ╞═══════╪═════════╪═════════╡
# │ 0     ┆ 0       ┆ 0       │
# │ 0     ┆ 1       ┆ 10      │
# │ 0     ┆ 2       ┆ 20      │
# │ 0     ┆ 3       ┆ 30      │
# │ …     ┆ …       ┆ …       │
# │ 3     ┆ 36      ┆ 360     │
# │ 3     ┆ 37      ┆ 370     │
# │ 3     ┆ 38      ┆ 380     │
# │ 3     ┆ 39      ┆ 390     │
# └───────┴─────────┴─────────┘

要为每个组和每个值创建单独的 head 和 tail 列，其中每个 head 和 tail 都是值的列表：

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head    ┆ value_a_tail ┆ value_b_tail    │
# │ ---   ┆ ---          ┆ ---             ┆ ---          ┆ ---             │
# │ i64   ┆ list[i64]    ┆ list[i64]       ┆ list[i64]    ┆ list[i64]       │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0     ┆ [0, 1, 2]    ┆ [0, 10, 20]     ┆ [7, 8, 9]    ┆ [70, 80, 90]    │
# │ 1     ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2     ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3     ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘

要为每个组和每个值创建单独的 head 和 tail 列，其中每个 head 和 tail 都是 "un-listed"（这种格式有用吗？我不确定...）：

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
    .explode(pl.exclude(&quot;group&quot;))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ ---   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
# │ i64   ┆ i64          ┆ i64          ┆ i64          ┆ i64          │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0     ┆ 0            ┆ 0            ┆ 7            ┆ 70           │
# │ 0     ┆ 1            ┆ 10           ┆ 8            ┆ 80           │
# │ 0     ┆ 2            ┆ 20           ┆ 9            ┆ 90           │
# │ 1     ┆ 10           ┆ 100          ┆ 17           ┆ 170          │
# │ …     ┆ …            ┆ …            ┆ …            ┆ …            │
# │ 2     ┆ 22           ┆ 220          ┆ 29           ┆ 290          │
# │ 3     ┆ 30           ┆ 300          ┆ 37           ┆ 370          │
# │ 3     ┆ 31           ┆ 310          ┆ 38           ┆ 380          │
# │ 3     ┆ 32           ┆ 320          ┆ 39           ┆ 390          │
# └──────
<details>
<summary>英文:</summary>
[@Wayoshi](https://stackoverflow.com/users/21191420/wayoshi) has already submitted an answer which you&#39;ve accepted, however, there&#39;s some ambiguity in the format which you&#39;d like the results to be in. So here I submit a couple of different solution forms to the same problem, inspired by @Wayoshi&#39;s example data, which might be useful to others
Creating example data:
```python
import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
    {
        &quot;group&quot;: np.repeat(range(groups), group_size),
        &quot;value_a&quot;: range(groups * group_size),
        &quot;value_b&quot;: range(0, groups * group_size ** 2, group_size),
    }
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ ---   ┆ ---     ┆ ---     │
# │ i64   ┆ i64     ┆ i64     │
# ╞═══════╪═════════╪═════════╡
# │ 0     ┆ 0       ┆ 0       │
# │ 0     ┆ 1       ┆ 10      │
# │ 0     ┆ 2       ┆ 20      │
# │ 0     ┆ 3       ┆ 30      │
# │ …     ┆ …       ┆ …       │
# │ 3     ┆ 36      ┆ 360     │
# │ 3     ┆ 37      ┆ 370     │
# │ 3     ┆ 38      ┆ 380     │
# │ 3     ┆ 39      ┆ 390     │
# └───────┴─────────┴─────────┘

To create separate head and tail columns for each group, and each value, where each head and tail is a list of values:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head    ┆ value_a_tail ┆ value_b_tail    │
# │ ---   ┆ ---          ┆ ---             ┆ ---          ┆ ---             │
# │ i64   ┆ list[i64]    ┆ list[i64]       ┆ list[i64]    ┆ list[i64]       │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0     ┆ [0, 1, 2]    ┆ [0, 10, 20]     ┆ [7, 8, 9]    ┆ [70, 80, 90]    │
# │ 1     ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2     ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3     ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘

To create separate head and tail columns for each group, and each value,
where each head and tail is "un-listed" (is this format useful? I'm not
sure...)

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
    .explode(pl.exclude(&quot;group&quot;))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ ---   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
# │ i64   ┆ i64          ┆ i64          ┆ i64          ┆ i64          │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0     ┆ 0            ┆ 0            ┆ 7            ┆ 70           │
# │ 0     ┆ 1            ┆ 10           ┆ 8            ┆ 80           │
# │ 0     ┆ 2            ┆ 20           ┆ 9            ┆ 90           │
# │ 1     ┆ 10           ┆ 100          ┆ 17           ┆ 170          │
# │ …     ┆ …            ┆ …            ┆ …            ┆ …            │
# │ 2     ┆ 22           ┆ 220          ┆ 29           ┆ 290          │
# │ 3     ┆ 30           ┆ 300          ┆ 37           ┆ 370          │
# │ 3     ┆ 31           ┆ 310          ┆ 38           ┆ 380          │
# │ 3     ┆ 32           ┆ 320          ┆ 39           ┆ 390          │
# └───────┴──────────────┴──────────────┴──────────────┴──────────────┘

To create a combined head_tail column for each group, where each head and tail is a list of values (again, is the format "useful"? probably not... but that's for someone else to decide):

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).append(pl.col(&quot;*&quot;).tail(n)).suffix(&quot;_head_tail&quot;),
     )
)
# shape: (4, 3)
# ┌───────┬───────────────────┬───────────────────┐
# │ group ┆ value_a_head_tail ┆ value_b_head_tail │
# │ ---   ┆ ---               ┆ ---               │
# │ i64   ┆ list[i64]         ┆ list[i64]         │
# ╞═══════╪═══════════════════╪═══════════════════╡
# │ 0     ┆ [0, 1, … 9]       ┆ [0, 10, … 90]     │
# │ 1     ┆ [10, 11, … 19]    ┆ [100, 110, … 190] │
# │ 2     ┆ [20, 21, … 29]    ┆ [200, 210, … 290] │
# │ 3     ┆ [30, 31, … 39]    ┆ [300, 310, … 390] │
# └───────┴───────────────────┴───────────────────┘

I'm not convinced that the operation would ever be useful, but who am I to say. If you wanted to head and tail all your value columns into a single column, stored as a list:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.concat([pl.exclude(&quot;group&quot;).head(n), pl.exclude(&quot;group&quot;).tail(n)]).alias(&quot;all_head_tail&quot;)
    )
)
# shape: (4, 2)
# ┌───────┬─────────────────┐
# │ group ┆ all_head_tail   │
# │ ---   ┆ ---             │
# │ i64   ┆ list[i64]       │
# ╞═══════╪═════════════════╡
# │ 0     ┆ [0, 1, … 90]    │
# │ 1     ┆ [10, 11, … 190] │
# │ 2     ┆ [20, 21, … 290] │
# │ 3     ┆ [30, 31, … 390] │
# └───────┴─────────────────┘

And finally, if you wanted to head and tail all your value columns into a single
column, and unlist the values (the format which Wayoshi offered), an
alternative way to achieve this would be:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.concat([pl.exclude(&quot;group&quot;).head(n), pl.exclude(&quot;group&quot;).tail(n)]).alias(&quot;all_head_tail&quot;)
    )
    .explode(&quot;all_head_tail&quot;)
)
# shape: (48, 2)
# ┌───────┬───────────────┐
# │ group ┆ all_head_tail │
# │ ---   ┆ ---           │
# │ i64   ┆ i64           │
# ╞═══════╪═══════════════╡
# │ 0     ┆ 0             │
# │ 0     ┆ 1             │
# │ 0     ┆ 2             │
# │ 0     ┆ 0             │
# │ …     ┆ …             │
# │ 3     ┆ 39            │
# │ 3     ┆ 370           │
# │ 3     ┆ 380           │
# │ 3     ┆ 390           │
# └───────┴───────────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to take head(n) and tail(n) in one groupby with polars

问题

答案1

答案2

df.str.get_dummies() 与 pd.get_dummies() (Python)

How can I arrange a dictionary using python so that the values are represented from lowest to highest order?

两个不同形状的NumPy数组的逐元素相乘

Fein调整一个已经训练好的XGBoost分类模型

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。