在Python Polars数据框中对日期时间进行排序。

huangapple go评论74阅读模式
英文:

Sort list of datetime in Python polars dataframe

问题

I have problems working with datetime

我的问题出在处理日期时间数据上。

my dataset is like this:

我的数据集如下所示:

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

我希望对每一行中的日期进行排序,并获取第一个和第二个日期,因为有些行包含多于2个值。但我无法在Polars上实现这个。

I'm trying with :

我正在尝试使用:

filter_df.filter(
    pl.col("order_purchase_timestamp").arr.sort()
    )

but I got this error:

但我收到了以下错误信息:

ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

英文:

I have problems working with datetime

my dataset is like this:

在Python Polars数据框中对日期时间进行排序。

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

I'm trying with :

filter_df.filter(
    pl.col("order_purchase_timestamp").arr.sort()
    )

but I got this error:
> ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

答案1

得分: 3

你可以使用.with_columns()上下文来选择和修改列。

col = pl.col("order_purchase_timestamp")

df.with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
)   #        /                      /
    #      按日期排序              取片段 [0:2]

从评论讨论中得出的解决方案

要根据条件在列下执行某些操作,您可以使用pl.when -> then -> otherwise结构。pl.when()接受某个布尔Series(条件)。了解更多信息这里

col = pl.col("order_purchase_timestamp")

# 2个日期之间的持续时间(微秒)
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))

df.filter(col.arr.lengths() >= 2).with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
    less_30 = pl.when(diff <= pl.duration(days=30))\
                .then(True).otherwise(False)
)
英文:

You can use .with_columns() context to select & modify columns.

col = pl.col(&quot;order_purchase_timestamp&quot;)

df.with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
)   #        /                      /
    #      sort by date            take slice [0:2]

Solution that follows from the discussion in the comments

To perform some action under column based on condition, you can use pl.when -&gt; then -&gt; otherwise construct. pl.when() takes some boolean Series (condition). Read more here.

col = pl.col(&quot;order_purchase_timestamp&quot;)

# duration (difference) between 2 dates (in microseconds)
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))

df.filter(col.arr.lengths() &gt;= 2).with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
    less_30 = pl.when(diff &lt;= pl.duration(days=30))\
                .then(True).otherwise(False)
)

答案2

得分: 0

以下是代码部分的翻译:

  1. Instead of operating on lists - you could .explode() the column.
    你可以使用.explode()列代替操作列表。

  2. .with_row_count() can be used as a "group id":
    .with_row_count()可以用作“group id”(分组标识):

  3. You could then filter based on the group count ("size"):
    之后,你可以基于分组计数进行筛选("size"):

  4. .groupby().agg() can be used to recreate the lists.
    .groupby().agg()可用于重新创建列表。

  5. .head / .limit to keep only the first N results.
    使用.head / .limit来保留前N个结果。

英文:

Instead of operating on lists - you could .explode() the column.

df = pl.DataFrame({&quot;letters&quot;: [[&quot;f&quot;, &quot;a&quot;, &quot;m&quot;], [&quot;b&quot;], [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;]]})
shape: (3, 1)
┌─────────────────┐
│ letters         │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ [&quot;f&quot;, &quot;a&quot;, &quot;m&quot;] │
│ [&quot;b&quot;]           │
│ [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;] │
└─────────────────┘

.with_row_count() can be used as a "group id":

df.with_row_count(&quot;group&quot;).explode(&quot;letters&quot;).sort(&quot;letters&quot;)
shape: (7, 2)
┌───────┬─────────┐
│ group | letters │
│ ---   | ---     │
│ u32   | str     │
╞═══════╪═════════╡
│ 0     | a       │
│ 1     | b       │
│ 2     | c       │
│ 2     | d       │
│ 2     | e       │
│ 0     | f       │
│ 0     | m       │
└───────┴─────────┘

You could then filter based on the group count ("size"):

(df.with_row_count(&quot;group&quot;)
   .explode(&quot;letters&quot;)
   .sort(&quot;letters&quot;)
   .filter(pl.count().over(&quot;group&quot;) &gt; 1))
shape: (6, 2)
┌───────┬─────────┐
│ group | letters │
│ ---   | ---     │
│ u32   | str     │
╞═══════╪═════════╡
│ 0     | a       │
│ 2     | c       │
│ 2     | d       │
│ 2     | e       │
│ 0     | f       │
│ 0     | m       │
└───────┴─────────┘

.groupby().agg() can be used to recreate the lists.

.head / .limit to keep only the first N results.

(df.with_row_count(&quot;group&quot;)
   .explode(&quot;letters&quot;)
   .sort(&quot;letters&quot;)
   .filter(pl.count().over(&quot;group&quot;) &gt; 1)
   .groupby(&quot;group&quot;)
   .agg(pl.col(&quot;letters&quot;).head(2)))
shape: (2, 2)
┌───────┬────────────┐
│ group | letters    │
│ ---   | ---        │
│ u32   | list[str]  │
╞═══════╪════════════╡
│ 0     | [&quot;a&quot;, &quot;f&quot;] │
│ 2     | [&quot;c&quot;, &quot;d&quot;] │
└───────┴────────────┘

huangapple
  • 本文由 发表于 2023年3月1日 08:51:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75598678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定