2023年3月1日 08:51:15go评论115阅读模式

英文:

Sort list of datetime in Python polars dataframe

问题

I have problems working with datetime

我的问题出在处理日期时间数据上。

my dataset is like this:

我的数据集如下所示：

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

我希望对每一行中的日期进行排序，并获取第一个和第二个日期，因为有些行包含多于2个值。但我无法在Polars上实现这个。

I'm trying with :

我正在尝试使用：

filter_df.filter(
    pl.col(&quot;order_purchase_timestamp&quot;).arr.sort()
    )

but I got this error:

但我收到了以下错误信息：

ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

英文:

I have problems working with datetime

my dataset is like this:

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

I'm trying with :

filter_df.filter(
    pl.col(&quot;order_purchase_timestamp&quot;).arr.sort()
    )

but I got this error:
> ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

答案1

得分: 3

你可以使用.with_columns()上下文来选择和修改列。

col = pl.col("order_purchase_timestamp")
df.with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
)   #        /                      /
    #      按日期排序              取片段 [0:2]

从评论讨论中得出的解决方案

要根据条件在列下执行某些操作，您可以使用pl.when -> then -> otherwise结构。pl.when()接受某个布尔Series（条件）。了解更多信息这里。

col = pl.col("order_purchase_timestamp")
# 2个日期之间的持续时间（微秒）
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
df.filter(col.arr.lengths() >= 2).with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
    less_30 = pl.when(diff <= pl.duration(days=30))\
                .then(True).otherwise(False)
)

英文:

You can use .with_columns() context to select & modify columns.

col = pl.col(&quot;order_purchase_timestamp&quot;)
df.with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
)   #        /                      /
    #      sort by date            take slice [0:2]

Solution that follows from the discussion in the comments

To perform some action under column based on condition, you can use pl.when -> then -> otherwise construct. pl.when() takes some boolean Series (condition). Read more here.

col = pl.col(&quot;order_purchase_timestamp&quot;)
# duration (difference) between 2 dates (in microseconds)
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
df.filter(col.arr.lengths() &gt;= 2).with_columns(
    col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
    less_30 = pl.when(diff &lt;= pl.duration(days=30))\
                .then(True).otherwise(False)
)

答案2

得分: 0

以下是代码部分的翻译：

Instead of operating on lists - you could .explode() the column.
你可以使用.explode()列代替操作列表。
.with_row_count() can be used as a "group id":
.with_row_count()可以用作“group id”（分组标识）：
You could then filter based on the group count ("size"):
之后，你可以基于分组计数进行筛选（"size"）：
.groupby().agg() can be used to recreate the lists.
.groupby().agg()可用于重新创建列表。
.head / .limit to keep only the first N results.
使用.head / .limit来保留前N个结果。

英文:

Instead of operating on lists - you could .explode() the column.

df = pl.DataFrame({&quot;letters&quot;: [[&quot;f&quot;, &quot;a&quot;, &quot;m&quot;], [&quot;b&quot;], [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;]]})

shape: (3, 1)
┌─────────────────┐
│ letters         │
│ ---             │
│ list[str]       │
╞═════════════════╡
│ [&quot;f&quot;, &quot;a&quot;, &quot;m&quot;] │
│ [&quot;b&quot;]           │
│ [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;] │
└─────────────────┘

.with_row_count() can be used as a "group id":

df.with_row_count(&quot;group&quot;).explode(&quot;letters&quot;).sort(&quot;letters&quot;)

shape: (7, 2)
┌───────┬─────────┐
│ group | letters │
│ ---   | ---     │
│ u32   | str     │
╞═══════╪═════════╡
│ 0     | a       │
│ 1     | b       │
│ 2     | c       │
│ 2     | d       │
│ 2     | e       │
│ 0     | f       │
│ 0     | m       │
└───────┴─────────┘

You could then filter based on the group count ("size"):

(df.with_row_count(&quot;group&quot;)
   .explode(&quot;letters&quot;)
   .sort(&quot;letters&quot;)
   .filter(pl.count().over(&quot;group&quot;) &gt; 1))

shape: (6, 2)
┌───────┬─────────┐
│ group | letters │
│ ---   | ---     │
│ u32   | str     │
╞═══════╪═════════╡
│ 0     | a       │
│ 2     | c       │
│ 2     | d       │
│ 2     | e       │
│ 0     | f       │
│ 0     | m       │
└───────┴─────────┘

.groupby().agg() can be used to recreate the lists.

.head / .limit to keep only the first N results.

(df.with_row_count(&quot;group&quot;)
   .explode(&quot;letters&quot;)
   .sort(&quot;letters&quot;)
   .filter(pl.count().over(&quot;group&quot;) &gt; 1)
   .groupby(&quot;group&quot;)
   .agg(pl.col(&quot;letters&quot;).head(2)))

shape: (2, 2)
┌───────┬────────────┐
│ group | letters    │
│ ---   | ---        │
│ u32   | list[str]  │
╞═══════╪════════════╡
│ 0     | [&quot;a&quot;, &quot;f&quot;] │
│ 2     | [&quot;c&quot;, &quot;d&quot;] │
└───────┴────────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python Polars数据框中对日期时间进行排序。

问题

答案1

答案2

确定数据框中各行之间的过渡和持续时间。

多线程执行多行代码在bash脚本中。

如何在两个搜索之间打印行？

Go真的比Python快那么多吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。