How to take head(n) and tail(n) in one groupby with polars

huangapple go评论80阅读模式
英文:

How to take head(n) and tail(n) in one groupby with polars

问题

这里是一个示例数据框,我想要在每一天中使用groupby("date").agg()获取前n行和后n行。我知道可以使用两个groupby来获取前部分和后部分,然后将它们连接起来。是否有一种在一个groupbywith_columnsfilter中执行此操作的技巧?

英文:

Here is a sample dataframe, I want to get head(n) rows and tail(n) rows in everyday which is a groupby("date").agg() with polars. I know it could use two groupby to get head and tail part and then concat them. Is there some trick to do this in one groupby or with_columns or filter?

How to take head(n) and tail(n) in one groupby with polars

答案1

得分: 3

这是一个通用的方法,用于获取每个组的 head(n)tail(n) 并将其合并到最终的DataFrame中,无需使用 concat 操作,示例中使用一个简单的DataFrame。

它使用 filter 来从每个组中移除中间的行,使用 cumcountcount 进行适当的计算,使用 over 在原地进行分组聚合:

N = 50

df = pl.DataFrame(
    {
        "group": [1] * N + [2] * N,
        "value": range(1, 2 * N + 1),
    }
)

n = 2
df.filter(
    ~pl.col('group')
    .cumcount()
    .over('group')
    .is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
 group  value 
 ---    ---   
 i64    i64   
╞═══════╪═══════╡
 1      1     
 1      2     
 1      49    
 1      50    
 2      51    
 2      52    
 2      99    
 2      100   
└───────┴───────┘

您可以调整 n 的值,看到它适用于所有非负的 n

英文:

Here's a general way to get the head(n) and tail(n) per group into a final DataFrame, without concat shenanigans, and using a trivial df as an example.

It uses filter to take out the middle rows over each group, with appropriate calculations being done with cumcount and count, with over doing the groupby-aggregations in place:

N = 50

df = pl.DataFrame(
    {
        "group": [1] * N + [2] * N,
        "value": range(1, 2 * N + 1),
    }
)

n = 2
df.filter(
    ~pl.col('group')
    .cumcount()
    .over('group')
    .is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
 group  value 
 ---    ---   
 i64    i64   
╞═══════╪═══════╡
 1      1     
 1      2     
 1      49    
 1      50    
 2      51    
 2      52    
 2      99    
 2      100   
└───────┴───────┘

You can adjust n and see that it works for all non-negative n.

答案2

得分: 1

@Wayoshi 已经提交了一个答案,你已经接受了,但是在你希望结果的格式方面存在一些不确定性。因此,在这里,我提交了几种不同的解决方案形式,与@Wayoshi的示例数据类似,这可能对其他人有用。

创建示例数据:

import numpy as np
import polars as pl

groups = 4
group_size = 10
df = pl.DataFrame(
    {
        "group": np.repeat(range(groups), group_size),
        "value_a": range(groups * group_size),
        "value_b": range(0, groups * group_size ** 2, group_size),
    }
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ ---   ┆ ---     ┆ ---     │
# │ i64   ┆ i64     ┆ i64     │
# ╞═══════╪═════════╪═════════╡
# │ 0     ┆ 0       ┆ 0       │
# │ 0     ┆ 1       ┆ 10      │
# │ 0     ┆ 2       ┆ 20      │
# │ 0     ┆ 3       ┆ 30      │
# │ …     ┆ …       ┆ …       │
# │ 3     ┆ 36      ┆ 360     │
# │ 3     ┆ 37      ┆ 370     │
# │ 3     ┆ 38      ┆ 380     │
# │ 3     ┆ 39      ┆ 390     │
# └───────┴─────────┴─────────┘

要为每个组和每个值创建单独的 headtail 列,其中每个 headtail 都是值的列表:

n = 3
(
    df
    .groupby("group", maintain_order=True)
    .agg(
        pl.col("*").head(n).suffix("_head"),
        pl.col("*").tail(n).suffix("_tail"),
    )
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head    ┆ value_a_tail ┆ value_b_tail    │
# │ ---   ┆ ---          ┆ ---             ┆ ---          ┆ ---             │
# │ i64   ┆ list[i64]    ┆ list[i64]       ┆ list[i64]    ┆ list[i64]       │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0     ┆ [0, 1, 2]    ┆ [0, 10, 20]     ┆ [7, 8, 9]    ┆ [70, 80, 90]    │
# │ 1     ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2     ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3     ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘

要为每个组和每个值创建单独的 headtail 列,其中每个 headtail 都是 "un-listed"(这种格式有用吗?我不确定...):

n = 3
(
    df
    .groupby("group", maintain_order=True)
    .agg(
        pl.col("*").head(n).suffix("_head"),
        pl.col("*").tail(n).suffix("_tail"),
    )
    .explode(pl.exclude("group"))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ ---   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
# │ i64   ┆ i64          ┆ i64          ┆ i64          ┆ i64          │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0     ┆ 0            ┆ 0            ┆ 7            ┆ 70           │
# │ 0     ┆ 1            ┆ 10           ┆ 8            ┆ 80           │
# │ 0     ┆ 2            ┆ 20           ┆ 9            ┆ 90           │
# │ 1     ┆ 10           ┆ 100          ┆ 17           ┆ 170          │
# │ …     ┆ …            ┆ …            ┆ …            ┆ …            │
# │ 2     ┆ 22           ┆ 220          ┆ 29           ┆ 290          │
# │ 3     ┆ 30           ┆ 300          ┆ 37           ┆ 370          │
# │ 3     ┆ 31           ┆ 310          ┆ 38           ┆ 380          │
# │ 3     ┆ 32           ┆ 320          ┆ 39           ┆ 390          │
# └──────

<details>
<summary>英文:</summary>

[@Wayoshi](https://stackoverflow.com/users/21191420/wayoshi) has already submitted an answer which you&#39;ve accepted, however, there&#39;s some ambiguity in the format which you&#39;d like the results to be in. So here I submit a couple of different solution forms to the same problem, inspired by @Wayoshi&#39;s example data, which might be useful to others

Creating example data:

```python
import numpy as np
import polars as pl

groups = 4
group_size = 10
df = pl.DataFrame(
    {
        &quot;group&quot;: np.repeat(range(groups), group_size),
        &quot;value_a&quot;: range(groups * group_size),
        &quot;value_b&quot;: range(0, groups * group_size ** 2, group_size),
    }
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ ---   ┆ ---     ┆ ---     │
# │ i64   ┆ i64     ┆ i64     │
# ╞═══════╪═════════╪═════════╡
# │ 0     ┆ 0       ┆ 0       │
# │ 0     ┆ 1       ┆ 10      │
# │ 0     ┆ 2       ┆ 20      │
# │ 0     ┆ 3       ┆ 30      │
# │ …     ┆ …       ┆ …       │
# │ 3     ┆ 36      ┆ 360     │
# │ 3     ┆ 37      ┆ 370     │
# │ 3     ┆ 38      ┆ 380     │
# │ 3     ┆ 39      ┆ 390     │
# └───────┴─────────┴─────────┘

To create separate head and tail columns for each group, and each value, where each head and tail is a list of values:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head    ┆ value_a_tail ┆ value_b_tail    │
# │ ---   ┆ ---          ┆ ---             ┆ ---          ┆ ---             │
# │ i64   ┆ list[i64]    ┆ list[i64]       ┆ list[i64]    ┆ list[i64]       │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0     ┆ [0, 1, 2]    ┆ [0, 10, 20]     ┆ [7, 8, 9]    ┆ [70, 80, 90]    │
# │ 1     ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2     ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3     ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘

To create separate head and tail columns for each group, and each value,
where each head and tail is "un-listed" (is this format useful? I'm not
sure...)

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).suffix(&quot;_head&quot;),
        pl.col(&quot;*&quot;).tail(n).suffix(&quot;_tail&quot;),
    )
    .explode(pl.exclude(&quot;group&quot;))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ ---   ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
# │ i64   ┆ i64          ┆ i64          ┆ i64          ┆ i64          │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0     ┆ 0            ┆ 0            ┆ 7            ┆ 70           │
# │ 0     ┆ 1            ┆ 10           ┆ 8            ┆ 80           │
# │ 0     ┆ 2            ┆ 20           ┆ 9            ┆ 90           │
# │ 1     ┆ 10           ┆ 100          ┆ 17           ┆ 170          │
# │ …     ┆ …            ┆ …            ┆ …            ┆ …            │
# │ 2     ┆ 22           ┆ 220          ┆ 29           ┆ 290          │
# │ 3     ┆ 30           ┆ 300          ┆ 37           ┆ 370          │
# │ 3     ┆ 31           ┆ 310          ┆ 38           ┆ 380          │
# │ 3     ┆ 32           ┆ 320          ┆ 39           ┆ 390          │
# └───────┴──────────────┴──────────────┴──────────────┴──────────────┘

To create a combined head_tail column for each group, where each head and tail is a list of values (again, is the format "useful"? probably not... but that's for someone else to decide):

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.col(&quot;*&quot;).head(n).append(pl.col(&quot;*&quot;).tail(n)).suffix(&quot;_head_tail&quot;),
     )
)
# shape: (4, 3)
# ┌───────┬───────────────────┬───────────────────┐
# │ group ┆ value_a_head_tail ┆ value_b_head_tail │
# │ ---   ┆ ---               ┆ ---               │
# │ i64   ┆ list[i64]         ┆ list[i64]         │
# ╞═══════╪═══════════════════╪═══════════════════╡
# │ 0     ┆ [0, 1, … 9]       ┆ [0, 10, … 90]     │
# │ 1     ┆ [10, 11, … 19]    ┆ [100, 110, … 190] │
# │ 2     ┆ [20, 21, … 29]    ┆ [200, 210, … 290] │
# │ 3     ┆ [30, 31, … 39]    ┆ [300, 310, … 390] │
# └───────┴───────────────────┴───────────────────┘

I'm not convinced that the operation would ever be useful, but who am I to say. If you wanted to head and tail all your value columns into a single column, stored as a list:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.concat([pl.exclude(&quot;group&quot;).head(n), pl.exclude(&quot;group&quot;).tail(n)]).alias(&quot;all_head_tail&quot;)
    )
)
# shape: (4, 2)
# ┌───────┬─────────────────┐
# │ group ┆ all_head_tail   │
# │ ---   ┆ ---             │
# │ i64   ┆ list[i64]       │
# ╞═══════╪═════════════════╡
# │ 0     ┆ [0, 1, … 90]    │
# │ 1     ┆ [10, 11, … 190] │
# │ 2     ┆ [20, 21, … 290] │
# │ 3     ┆ [30, 31, … 390] │
# └───────┴─────────────────┘

And finally, if you wanted to head and tail all your value columns into a single
column, and unlist the values (the format which Wayoshi offered), an
alternative way to achieve this would be:

n = 3
(
    df
    .groupby(&quot;group&quot;, maintain_order=True)
    .agg(
        pl.concat([pl.exclude(&quot;group&quot;).head(n), pl.exclude(&quot;group&quot;).tail(n)]).alias(&quot;all_head_tail&quot;)
    )
    .explode(&quot;all_head_tail&quot;)
)
# shape: (48, 2)
# ┌───────┬───────────────┐
# │ group ┆ all_head_tail │
# │ ---   ┆ ---           │
# │ i64   ┆ i64           │
# ╞═══════╪═══════════════╡
# │ 0     ┆ 0             │
# │ 0     ┆ 1             │
# │ 0     ┆ 2             │
# │ 0     ┆ 0             │
# │ …     ┆ …             │
# │ 3     ┆ 39            │
# │ 3     ┆ 370           │
# │ 3     ┆ 380           │
# │ 3     ┆ 390           │
# └───────┴───────────────┘

huangapple
  • 本文由 发表于 2023年7月20日 11:38:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76726524.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定