英文:
How to take head(n) and tail(n) in one groupby with polars
问题
这里是一个示例数据框,我想要在每一天中使用groupby("date").agg()获取前n行和后n行。我知道可以使用两个groupby来获取前部分和后部分,然后将它们连接起来。是否有一种在一个groupby或with_columns或filter中执行此操作的技巧?
英文:
Here is a sample dataframe, I want to get head(n) rows and tail(n) rows in everyday which is a groupby("date").agg() with polars. I know it could use two groupby to get head and tail part and then concat them. Is there some trick to do this in one groupby or with_columns or filter?
答案1
得分: 3
这是一个通用的方法,用于获取每个组的 head(n) 和 tail(n) 并将其合并到最终的DataFrame中,无需使用 concat 操作,示例中使用一个简单的DataFrame。
它使用 filter 来从每个组中移除中间的行,使用 cumcount 和 count 进行适当的计算,使用 over 在原地进行分组聚合:
N = 50
df = pl.DataFrame(
{
"group": [1] * N + [2] * N,
"value": range(1, 2 * N + 1),
}
)
n = 2
df.filter(
~pl.col('group')
.cumcount()
.over('group')
.is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
│ group │ value │
│ --- │ --- │
│ i64 │ i64 │
╞═══════╪═══════╡
│ 1 │ 1 │
│ 1 │ 2 │
│ 1 │ 49 │
│ 1 │ 50 │
│ 2 │ 51 │
│ 2 │ 52 │
│ 2 │ 99 │
│ 2 │ 100 │
└───────┴───────┘
您可以调整 n 的值,看到它适用于所有非负的 n。
英文:
Here's a general way to get the head(n) and tail(n) per group into a final DataFrame, without concat shenanigans, and using a trivial df as an example.
It uses filter to take out the middle rows over each group, with appropriate calculations being done with cumcount and count, with over doing the groupby-aggregations in place:
N = 50
df = pl.DataFrame(
{
"group": [1] * N + [2] * N,
"value": range(1, 2 * N + 1),
}
)
n = 2
df.filter(
~pl.col('group')
.cumcount()
.over('group')
.is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 1 ┆ 49 │
│ 1 ┆ 50 │
│ 2 ┆ 51 │
│ 2 ┆ 52 │
│ 2 ┆ 99 │
│ 2 ┆ 100 │
└───────┴───────┘
You can adjust n and see that it works for all non-negative n.
答案2
得分: 1
@Wayoshi 已经提交了一个答案,你已经接受了,但是在你希望结果的格式方面存在一些不确定性。因此,在这里,我提交了几种不同的解决方案形式,与@Wayoshi的示例数据类似,这可能对其他人有用。
创建示例数据:
import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
{
"group": np.repeat(range(groups), group_size),
"value_a": range(groups * group_size),
"value_b": range(0, groups * group_size ** 2, group_size),
}
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═══════╪═════════╪═════════╡
# │ 0 ┆ 0 ┆ 0 │
# │ 0 ┆ 1 ┆ 10 │
# │ 0 ┆ 2 ┆ 20 │
# │ 0 ┆ 3 ┆ 30 │
# │ … ┆ … ┆ … │
# │ 3 ┆ 36 ┆ 360 │
# │ 3 ┆ 37 ┆ 370 │
# │ 3 ┆ 38 ┆ 380 │
# │ 3 ┆ 39 ┆ 390 │
# └───────┴─────────┴─────────┘
要为每个组和每个值创建单独的 head 和 tail 列,其中每个 head 和 tail 都是值的列表:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0 ┆ [0, 1, 2] ┆ [0, 10, 20] ┆ [7, 8, 9] ┆ [70, 80, 90] │
# │ 1 ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2 ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3 ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘
要为每个组和每个值创建单独的 head 和 tail 列,其中每个 head 和 tail 都是 "un-listed"(这种格式有用吗?我不确定...):
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
.explode(pl.exclude("group"))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0 ┆ 0 ┆ 0 ┆ 7 ┆ 70 │
# │ 0 ┆ 1 ┆ 10 ┆ 8 ┆ 80 │
# │ 0 ┆ 2 ┆ 20 ┆ 9 ┆ 90 │
# │ 1 ┆ 10 ┆ 100 ┆ 17 ┆ 170 │
# │ … ┆ … ┆ … ┆ … ┆ … │
# │ 2 ┆ 22 ┆ 220 ┆ 29 ┆ 290 │
# │ 3 ┆ 30 ┆ 300 ┆ 37 ┆ 370 │
# │ 3 ┆ 31 ┆ 310 ┆ 38 ┆ 380 │
# │ 3 ┆ 32 ┆ 320 ┆ 39 ┆ 390 │
# └──────
<details>
<summary>英文:</summary>
[@Wayoshi](https://stackoverflow.com/users/21191420/wayoshi) has already submitted an answer which you've accepted, however, there's some ambiguity in the format which you'd like the results to be in. So here I submit a couple of different solution forms to the same problem, inspired by @Wayoshi's example data, which might be useful to others
Creating example data:
```python
import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
{
"group": np.repeat(range(groups), group_size),
"value_a": range(groups * group_size),
"value_b": range(0, groups * group_size ** 2, group_size),
}
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═══════╪═════════╪═════════╡
# │ 0 ┆ 0 ┆ 0 │
# │ 0 ┆ 1 ┆ 10 │
# │ 0 ┆ 2 ┆ 20 │
# │ 0 ┆ 3 ┆ 30 │
# │ … ┆ … ┆ … │
# │ 3 ┆ 36 ┆ 360 │
# │ 3 ┆ 37 ┆ 370 │
# │ 3 ┆ 38 ┆ 380 │
# │ 3 ┆ 39 ┆ 390 │
# └───────┴─────────┴─────────┘
To create separate head and tail columns for each group, and each value, where each head and tail is a list of values:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0 ┆ [0, 1, 2] ┆ [0, 10, 20] ┆ [7, 8, 9] ┆ [70, 80, 90] │
# │ 1 ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2 ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3 ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘
To create separate head and tail columns for each group, and each value,
where each head and tail is "un-listed" (is this format useful? I'm not
sure...)
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
.explode(pl.exclude("group"))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0 ┆ 0 ┆ 0 ┆ 7 ┆ 70 │
# │ 0 ┆ 1 ┆ 10 ┆ 8 ┆ 80 │
# │ 0 ┆ 2 ┆ 20 ┆ 9 ┆ 90 │
# │ 1 ┆ 10 ┆ 100 ┆ 17 ┆ 170 │
# │ … ┆ … ┆ … ┆ … ┆ … │
# │ 2 ┆ 22 ┆ 220 ┆ 29 ┆ 290 │
# │ 3 ┆ 30 ┆ 300 ┆ 37 ┆ 370 │
# │ 3 ┆ 31 ┆ 310 ┆ 38 ┆ 380 │
# │ 3 ┆ 32 ┆ 320 ┆ 39 ┆ 390 │
# └───────┴──────────────┴──────────────┴──────────────┴──────────────┘
To create a combined head_tail column for each group, where each head and tail is a list of values (again, is the format "useful"? probably not... but that's for someone else to decide):
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).append(pl.col("*").tail(n)).suffix("_head_tail"),
)
)
# shape: (4, 3)
# ┌───────┬───────────────────┬───────────────────┐
# │ group ┆ value_a_head_tail ┆ value_b_head_tail │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] │
# ╞═══════╪═══════════════════╪═══════════════════╡
# │ 0 ┆ [0, 1, … 9] ┆ [0, 10, … 90] │
# │ 1 ┆ [10, 11, … 19] ┆ [100, 110, … 190] │
# │ 2 ┆ [20, 21, … 29] ┆ [200, 210, … 290] │
# │ 3 ┆ [30, 31, … 39] ┆ [300, 310, … 390] │
# └───────┴───────────────────┴───────────────────┘
I'm not convinced that the operation would ever be useful, but who am I to say. If you wanted to head and tail all your value columns into a single column, stored as a list:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.concat([pl.exclude("group").head(n), pl.exclude("group").tail(n)]).alias("all_head_tail")
)
)
# shape: (4, 2)
# ┌───────┬─────────────────┐
# │ group ┆ all_head_tail │
# │ --- ┆ --- │
# │ i64 ┆ list[i64] │
# ╞═══════╪═════════════════╡
# │ 0 ┆ [0, 1, … 90] │
# │ 1 ┆ [10, 11, … 190] │
# │ 2 ┆ [20, 21, … 290] │
# │ 3 ┆ [30, 31, … 390] │
# └───────┴─────────────────┘
And finally, if you wanted to head and tail all your value columns into a single
column, and unlist the values (the format which Wayoshi offered), an
alternative way to achieve this would be:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.concat([pl.exclude("group").head(n), pl.exclude("group").tail(n)]).alias("all_head_tail")
)
.explode("all_head_tail")
)
# shape: (48, 2)
# ┌───────┬───────────────┐
# │ group ┆ all_head_tail │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪═══════════════╡
# │ 0 ┆ 0 │
# │ 0 ┆ 1 │
# │ 0 ┆ 2 │
# │ 0 ┆ 0 │
# │ … ┆ … │
# │ 3 ┆ 39 │
# │ 3 ┆ 370 │
# │ 3 ┆ 380 │
# │ 3 ┆ 390 │
# └───────┴───────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。



评论