英文:
How to take head(n) and tail(n) in one groupby with polars
问题
这里是一个示例数据框,我想要在每一天中使用groupby("date").agg()
获取前n行和后n行。我知道可以使用两个groupby
来获取前部分和后部分,然后将它们连接起来。是否有一种在一个groupby
或with_columns
或filter
中执行此操作的技巧?
英文:
Here is a sample dataframe, I want to get head(n) rows and tail(n) rows in everyday which is a groupby("date").agg()
with polars. I know it could use two groupby to get head and tail part and then concat them. Is there some trick to do this in one groupby or with_columns or filter?
答案1
得分: 3
这是一个通用的方法,用于获取每个组的 head(n)
和 tail(n)
并将其合并到最终的DataFrame中,无需使用 concat
操作,示例中使用一个简单的DataFrame。
它使用 filter
来从每个组中移除中间的行,使用 cumcount
和 count
进行适当的计算,使用 over
在原地进行分组聚合:
N = 50
df = pl.DataFrame(
{
"group": [1] * N + [2] * N,
"value": range(1, 2 * N + 1),
}
)
n = 2
df.filter(
~pl.col('group')
.cumcount()
.over('group')
.is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
│ group │ value │
│ --- │ --- │
│ i64 │ i64 │
╞═══════╪═══════╡
│ 1 │ 1 │
│ 1 │ 2 │
│ 1 │ 49 │
│ 1 │ 50 │
│ 2 │ 51 │
│ 2 │ 52 │
│ 2 │ 99 │
│ 2 │ 100 │
└───────┴───────┘
您可以调整 n
的值,看到它适用于所有非负的 n
。
英文:
Here's a general way to get the head(n)
and tail(n)
per group into a final DataFrame, without concat
shenanigans, and using a trivial df as an example.
It uses filter
to take out the middle rows over each group, with appropriate calculations being done with cumcount
and count
, with over
doing the groupby-aggregations in place:
N = 50
df = pl.DataFrame(
{
"group": [1] * N + [2] * N,
"value": range(1, 2 * N + 1),
}
)
n = 2
df.filter(
~pl.col('group')
.cumcount()
.over('group')
.is_between(n, pl.col('group').count().over('group') - n, closed='left')
)
shape: (8, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 1 ┆ 49 │
│ 1 ┆ 50 │
│ 2 ┆ 51 │
│ 2 ┆ 52 │
│ 2 ┆ 99 │
│ 2 ┆ 100 │
└───────┴───────┘
You can adjust n
and see that it works for all non-negative n
.
答案2
得分: 1
@Wayoshi 已经提交了一个答案,你已经接受了,但是在你希望结果的格式方面存在一些不确定性。因此,在这里,我提交了几种不同的解决方案形式,与@Wayoshi的示例数据类似,这可能对其他人有用。
创建示例数据:
import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
{
"group": np.repeat(range(groups), group_size),
"value_a": range(groups * group_size),
"value_b": range(0, groups * group_size ** 2, group_size),
}
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═══════╪═════════╪═════════╡
# │ 0 ┆ 0 ┆ 0 │
# │ 0 ┆ 1 ┆ 10 │
# │ 0 ┆ 2 ┆ 20 │
# │ 0 ┆ 3 ┆ 30 │
# │ … ┆ … ┆ … │
# │ 3 ┆ 36 ┆ 360 │
# │ 3 ┆ 37 ┆ 370 │
# │ 3 ┆ 38 ┆ 380 │
# │ 3 ┆ 39 ┆ 390 │
# └───────┴─────────┴─────────┘
要为每个组和每个值创建单独的 head
和 tail
列,其中每个 head
和 tail
都是值的列表:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0 ┆ [0, 1, 2] ┆ [0, 10, 20] ┆ [7, 8, 9] ┆ [70, 80, 90] │
# │ 1 ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2 ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3 ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘
要为每个组和每个值创建单独的 head
和 tail
列,其中每个 head
和 tail
都是 "un-listed"(这种格式有用吗?我不确定...):
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
.explode(pl.exclude("group"))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0 ┆ 0 ┆ 0 ┆ 7 ┆ 70 │
# │ 0 ┆ 1 ┆ 10 ┆ 8 ┆ 80 │
# │ 0 ┆ 2 ┆ 20 ┆ 9 ┆ 90 │
# │ 1 ┆ 10 ┆ 100 ┆ 17 ┆ 170 │
# │ … ┆ … ┆ … ┆ … ┆ … │
# │ 2 ┆ 22 ┆ 220 ┆ 29 ┆ 290 │
# │ 3 ┆ 30 ┆ 300 ┆ 37 ┆ 370 │
# │ 3 ┆ 31 ┆ 310 ┆ 38 ┆ 380 │
# │ 3 ┆ 32 ┆ 320 ┆ 39 ┆ 390 │
# └──────
<details>
<summary>英文:</summary>
[@Wayoshi](https://stackoverflow.com/users/21191420/wayoshi) has already submitted an answer which you've accepted, however, there's some ambiguity in the format which you'd like the results to be in. So here I submit a couple of different solution forms to the same problem, inspired by @Wayoshi's example data, which might be useful to others
Creating example data:
```python
import numpy as np
import polars as pl
groups = 4
group_size = 10
df = pl.DataFrame(
{
"group": np.repeat(range(groups), group_size),
"value_a": range(groups * group_size),
"value_b": range(0, groups * group_size ** 2, group_size),
}
)
df
# shape: (40, 3)
# ┌───────┬─────────┬─────────┐
# │ group ┆ value_a ┆ value_b │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═══════╪═════════╪═════════╡
# │ 0 ┆ 0 ┆ 0 │
# │ 0 ┆ 1 ┆ 10 │
# │ 0 ┆ 2 ┆ 20 │
# │ 0 ┆ 3 ┆ 30 │
# │ … ┆ … ┆ … │
# │ 3 ┆ 36 ┆ 360 │
# │ 3 ┆ 37 ┆ 370 │
# │ 3 ┆ 38 ┆ 380 │
# │ 3 ┆ 39 ┆ 390 │
# └───────┴─────────┴─────────┘
To create separate head
and tail
columns for each group, and each value, where each head and tail is a list of values:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
)
# shape: (4, 5)
# ┌───────┬──────────────┬─────────────────┬──────────────┬─────────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════╪══════════════╪═════════════════╪══════════════╪═════════════════╡
# │ 0 ┆ [0, 1, 2] ┆ [0, 10, 20] ┆ [7, 8, 9] ┆ [70, 80, 90] │
# │ 1 ┆ [10, 11, 12] ┆ [100, 110, 120] ┆ [17, 18, 19] ┆ [170, 180, 190] │
# │ 2 ┆ [20, 21, 22] ┆ [200, 210, 220] ┆ [27, 28, 29] ┆ [270, 280, 290] │
# │ 3 ┆ [30, 31, 32] ┆ [300, 310, 320] ┆ [37, 38, 39] ┆ [370, 380, 390] │
# └───────┴──────────────┴─────────────────┴──────────────┴─────────────────┘
To create separate head
and tail
columns for each group, and each value,
where each head and tail is "un-listed" (is this format useful? I'm not
sure...)
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).suffix("_head"),
pl.col("*").tail(n).suffix("_tail"),
)
.explode(pl.exclude("group"))
)
# shape: (12, 5)
# ┌───────┬──────────────┬──────────────┬──────────────┬──────────────┐
# │ group ┆ value_a_head ┆ value_b_head ┆ value_a_tail ┆ value_b_tail │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═══════╪══════════════╪══════════════╪══════════════╪══════════════╡
# │ 0 ┆ 0 ┆ 0 ┆ 7 ┆ 70 │
# │ 0 ┆ 1 ┆ 10 ┆ 8 ┆ 80 │
# │ 0 ┆ 2 ┆ 20 ┆ 9 ┆ 90 │
# │ 1 ┆ 10 ┆ 100 ┆ 17 ┆ 170 │
# │ … ┆ … ┆ … ┆ … ┆ … │
# │ 2 ┆ 22 ┆ 220 ┆ 29 ┆ 290 │
# │ 3 ┆ 30 ┆ 300 ┆ 37 ┆ 370 │
# │ 3 ┆ 31 ┆ 310 ┆ 38 ┆ 380 │
# │ 3 ┆ 32 ┆ 320 ┆ 39 ┆ 390 │
# └───────┴──────────────┴──────────────┴──────────────┴──────────────┘
To create a combined head_tail
column for each group, where each head and tail is a list of values (again, is the format "useful"? probably not... but that's for someone else to decide):
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.col("*").head(n).append(pl.col("*").tail(n)).suffix("_head_tail"),
)
)
# shape: (4, 3)
# ┌───────┬───────────────────┬───────────────────┐
# │ group ┆ value_a_head_tail ┆ value_b_head_tail │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ list[i64] ┆ list[i64] │
# ╞═══════╪═══════════════════╪═══════════════════╡
# │ 0 ┆ [0, 1, … 9] ┆ [0, 10, … 90] │
# │ 1 ┆ [10, 11, … 19] ┆ [100, 110, … 190] │
# │ 2 ┆ [20, 21, … 29] ┆ [200, 210, … 290] │
# │ 3 ┆ [30, 31, … 39] ┆ [300, 310, … 390] │
# └───────┴───────────────────┴───────────────────┘
I'm not convinced that the operation would ever be useful, but who am I to say. If you wanted to head and tail all your value columns into a single column, stored as a list:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.concat([pl.exclude("group").head(n), pl.exclude("group").tail(n)]).alias("all_head_tail")
)
)
# shape: (4, 2)
# ┌───────┬─────────────────┐
# │ group ┆ all_head_tail │
# │ --- ┆ --- │
# │ i64 ┆ list[i64] │
# ╞═══════╪═════════════════╡
# │ 0 ┆ [0, 1, … 90] │
# │ 1 ┆ [10, 11, … 190] │
# │ 2 ┆ [20, 21, … 290] │
# │ 3 ┆ [30, 31, … 390] │
# └───────┴─────────────────┘
And finally, if you wanted to head and tail all your value columns into a single
column, and unlist the values (the format which Wayoshi offered), an
alternative way to achieve this would be:
n = 3
(
df
.groupby("group", maintain_order=True)
.agg(
pl.concat([pl.exclude("group").head(n), pl.exclude("group").tail(n)]).alias("all_head_tail")
)
.explode("all_head_tail")
)
# shape: (48, 2)
# ┌───────┬───────────────┐
# │ group ┆ all_head_tail │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪═══════════════╡
# │ 0 ┆ 0 │
# │ 0 ┆ 1 │
# │ 0 ┆ 2 │
# │ 0 ┆ 0 │
# │ … ┆ … │
# │ 3 ┆ 39 │
# │ 3 ┆ 370 │
# │ 3 ┆ 380 │
# │ 3 ┆ 390 │
# └───────┴───────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论