Polars:将列表列填充到特定大小

huangapple go评论87阅读模式
英文:

Polars: Pad list columns to specific size

问题

我觉得我遇到了XY问题...

以下是我实际想要做的事情:

准确地说,我有一个数据框,如下所示:

形状:(3, 3)
┌───────────┬───────┬──────────────────────────┐
│ nrs       ┆ stuff ┆ more_stuff               │
│ ---       ┆ ---   ┆ ---                      │
│ list[i64] ┆ i64   ┆ list[list[i64]]          │
╞═══════════╪═══════╪══════════════════════════╡
│ [1, 2, 3] ┆ 1     ┆ [[1, 1], [2, 2], [3, 3]] │
│ [2, 4]    ┆ 2     ┆ [[4, 4], [5, 5]]         │
│ [1]       ┆ 3     ┆ [[6, 6]]                 │
└───────────┴───────┴──────────────────────────┘

具有普通int64列、list[int64]列和一个list[list[64]]列。我希望能够指定一个大小,并将所有列表(包括嵌套列表)的长度设置为该大小。可以通过缩短到该大小或通过使用它们的最后一个值进行填充(对于普通Python列表,使用list[-1])来实现,对于嵌套列表和普通列表都适用。非列表列应保持不变。

因此,对于上述数据框的N=2,结果应为:

形状:(3, 3)
┌───────────┬───────┬──────────────────┐
│ nrs       ┆ stuff ┆ more_stuff       │
│ ---       ┆ ---   ┆ ---              │
│ list[i64] ┆ i64   ┆ list[list[i64]]  │
╞═══════════╪═══════╪══════════════════╡
│ [1, 2]    ┆ 1     ┆ [[1, 1], [2, 2]] │
│ [2, 4]    ┆ 2     ┆ [[4, 4], [5, 5]] │
│ [1, 1]    ┆ 3     ┆ [[6, 6], [6, 6]] │
└───────────┴───────┴──────────────────┘
英文:

I think i ran into the XY problem...

Here is what i actually want to do:

To be exact i have a dataframe like:

shape: (3, 3)
┌───────────┬───────┬──────────────────────────┐
│ nrs       ┆ stuff ┆ more_stuff               │
│ ---       ┆ ---   ┆ ---                      │
│ list[i64] ┆ i64   ┆ list[list[i64]]          │
╞═══════════╪═══════╪══════════════════════════╡
│ [1, 2, 3] ┆ 1     ┆ [[1, 1], [2, 2], [3, 3]] │
│ [2, 4]    ┆ 2     ┆ [[4, 4], [5, 5]]         │
│ [1]       ┆ 3     ┆ [[6, 6]]                 │
└───────────┴───────┴──────────────────────────┘

With normal int64 columns, list[int64] columns and one list[list[64]] column.
I want to be able to specify a size and set the length of all the list (also the nested) columns to that size. Either by shortening to that size or by padding them with their last value (list[-1] for normal python lists) for both the nested and the normal lists. The non-list columns should be left unchanged.

So the result for N=2 for the above dataframe should be:

shape: (3, 3)
┌───────────┬───────┬──────────────────┐
│ nrs       ┆ stuff ┆ more_stuff       │
│ ---       ┆ ---   ┆ ---              │
│ list[i64] ┆ i64   ┆ list[list[i64]]  │
╞═══════════╪═══════╪══════════════════╡
│ [1, 2]    ┆ 1     ┆ [[1, 1], [2, 2]] │
│ [2, 4]    ┆ 2     ┆ [[4, 4], [5, 5]] │
│ [1, 1]    ┆ 3     ┆ [[6, 6], [6, 6]] │
└───────────┴───────┴──────────────────┘

答案1

得分: 1

回答来自Reddit的/u/commandlineluser:

df = pl.DataFrame({
   "nrs": [[1, 2, 3], [2, 4], [1]], 
   "stuff": [1, 2, 3], 
   "more_stuff": [[[1, 1], [2, 2], [3, 3]], [[4, 4], [5, 5]], [[6, 6]]]
})

cols = "nrs", "more_stuff"

df.with_columns(
   pl.col(cols).arr.take(
      pl.arange(0, pl.col(cols).arr.lengths().max()), 
      null_on_oob=True
   ).arr.eval(pl.element().forward_fill())
)

shape: (3, 3)

┌───────────┬───────┬──────────────────────────┐

│ nrs ┆ stuff ┆ more_stuff │

│ --- ┆ --- ┆ --- │

│ list[i64] ┆ i64 ┆ list[list[i64]] │

╞═══════════╪═══════╪══════════════════════════╡

│ [1, 2, 3] ┆ 1 ┆ [[1, 1], [2, 2], [3, 3]] │

│ [2, 4, 4] ┆ 2 ┆ [[4, 4], [5, 5], [5, 5]] │

│ [1, 1, 1] ┆ 3 ┆ [[6, 6], [6, 6], [6, 6]] │

└───────────┴───────┴──────────────────────────┘

你可以添加.slice()/.head()来将它们填充到较小的长度。

编辑:

可能值得注意的是,.forward_fill() 并不专门针对填充的空值,所以如果初始数据中有空值,这可能会成为一个问题。可以处理这个问题,但需要更多的代码。

英文:

Answer via reddit from /u/commandlineluser

df = pl.DataFrame({
   "nrs": [[1, 2, 3], [2, 4], [1]], 
   "stuff": [1, 2, 3], 
   "more_stuff": [[[1, 1], [2, 2], [3, 3]], [[4, 4], [5, 5]], [[6, 6]]]
})

cols = "nrs", "more_stuff"

df.with_columns(
   pl.col(cols).arr.take(
      pl.arange(0, pl.col(cols).arr.lengths().max()), 
      null_on_oob=True
   ).arr.eval(pl.element().forward_fill())
)

# shape: (3, 3)
# ┌───────────┬───────┬──────────────────────────┐
# │ nrs       ┆ stuff ┆ more_stuff               │
# │ ---       ┆ ---   ┆ ---                      │
# │ list[i64] ┆ i64   ┆ list[list[i64]]          │
# ╞═══════════╪═══════╪══════════════════════════╡
# │ [1, 2, 3] ┆ 1     ┆ [[1, 1], [2, 2], [3, 3]] │
# │ [2, 4, 4] ┆ 2     ┆ [[4, 4], [5, 5], [5, 5]] │
# │ [1, 1, 1] ┆ 3     ┆ [[6, 6], [6, 6], [6, 6]] │
# └───────────┴───────┴──────────────────────────┘

You can add a .slice()/.head() to pad them to a smaller length.

EDIT:

Possibly worth noting that .forward_fill() doesn't specifically target the padded nulls, so that could be an issue if there were nulls in the initial data. It's possible to handle this, but requires a bit more code.

答案2

得分: 0

查看 extend_constant

您可以使用 head / tail 进行缩短。

英文:

See extend_constant.

You can shorten with head / tail

huangapple
  • 本文由 发表于 2023年5月29日 01:51:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352834.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定