在Python Polars数据框中对日期时间进行排序。

huangapple go评论115阅读模式
英文:

Sort list of datetime in Python polars dataframe

问题

I have problems working with datetime

我的问题出在处理日期时间数据上。

my dataset is like this:

我的数据集如下所示:

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

我希望对每一行中的日期进行排序,并获取第一个和第二个日期,因为有些行包含多于2个值。但我无法在Polars上实现这个。

I'm trying with :

我正在尝试使用:

  1. filter_df.filter(
  2. pl.col("order_purchase_timestamp").arr.sort()
  3. )

but I got this error:

但我收到了以下错误信息:

ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

英文:

I have problems working with datetime

my dataset is like this:

在Python Polars数据框中对日期时间进行排序。

And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.

I'm trying with :

  1. filter_df.filter(
  2. pl.col("order_purchase_timestamp").arr.sort()
  3. )

but I got this error:
> ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))

答案1

得分: 3

你可以使用.with_columns()上下文来选择和修改列。

  1. col = pl.col("order_purchase_timestamp")
  2. df.with_columns(
  3. col.arr.sort(reverse=True).arr.slice(0, 2)
  4. ) # / /
  5. # 按日期排序 取片段 [0:2]

从评论讨论中得出的解决方案

要根据条件在列下执行某些操作,您可以使用pl.when -> then -> otherwise结构。pl.when()接受某个布尔Series(条件)。了解更多信息这里

  1. col = pl.col("order_purchase_timestamp")
  2. # 2个日期之间的持续时间(微秒)
  3. diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
  4. df.filter(col.arr.lengths() >= 2).with_columns(
  5. col.arr.sort(reverse=True).arr.slice(0, 2)
  6. ).with_columns(
  7. less_30 = pl.when(diff <= pl.duration(days=30))\
  8. .then(True).otherwise(False)
  9. )
英文:

You can use .with_columns() context to select & modify columns.

  1. col = pl.col(&quot;order_purchase_timestamp&quot;)
  2. df.with_columns(
  3. col.arr.sort(reverse=True).arr.slice(0, 2)
  4. ) # / /
  5. # sort by date take slice [0:2]

Solution that follows from the discussion in the comments

To perform some action under column based on condition, you can use pl.when -&gt; then -&gt; otherwise construct. pl.when() takes some boolean Series (condition). Read more here.

  1. col = pl.col(&quot;order_purchase_timestamp&quot;)
  2. # duration (difference) between 2 dates (in microseconds)
  3. diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
  4. df.filter(col.arr.lengths() &gt;= 2).with_columns(
  5. col.arr.sort(reverse=True).arr.slice(0, 2)
  6. ).with_columns(
  7. less_30 = pl.when(diff &lt;= pl.duration(days=30))\
  8. .then(True).otherwise(False)
  9. )

答案2

得分: 0

以下是代码部分的翻译:

  1. Instead of operating on lists - you could .explode() the column.
    你可以使用.explode()列代替操作列表。

  2. .with_row_count() can be used as a "group id":
    .with_row_count()可以用作“group id”(分组标识):

  3. You could then filter based on the group count ("size"):
    之后,你可以基于分组计数进行筛选("size"):

  4. .groupby().agg() can be used to recreate the lists.
    .groupby().agg()可用于重新创建列表。

  5. .head / .limit to keep only the first N results.
    使用.head / .limit来保留前N个结果。

英文:

Instead of operating on lists - you could .explode() the column.

  1. df = pl.DataFrame({&quot;letters&quot;: [[&quot;f&quot;, &quot;a&quot;, &quot;m&quot;], [&quot;b&quot;], [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;]]})
  1. shape: (3, 1)
  2. ┌─────────────────┐
  3. letters
  4. ---
  5. list[str]
  6. ╞═════════════════╡
  7. [&quot;f&quot;, &quot;a&quot;, &quot;m&quot;]
  8. [&quot;b&quot;]
  9. [&quot;e&quot;, &quot;d&quot;, &quot;c&quot;]
  10. └─────────────────┘

.with_row_count() can be used as a "group id":

  1. df.with_row_count(&quot;group&quot;).explode(&quot;letters&quot;).sort(&quot;letters&quot;)
  1. shape: (7, 2)
  2. ┌───────┬─────────┐
  3. group | letters
  4. --- | ---
  5. u32 | str
  6. ╞═══════╪═════════╡
  7. 0 | a
  8. 1 | b
  9. 2 | c
  10. 2 | d
  11. 2 | e
  12. 0 | f
  13. 0 | m
  14. └───────┴─────────┘

You could then filter based on the group count ("size"):

  1. (df.with_row_count(&quot;group&quot;)
  2. .explode(&quot;letters&quot;)
  3. .sort(&quot;letters&quot;)
  4. .filter(pl.count().over(&quot;group&quot;) &gt; 1))
  1. shape: (6, 2)
  2. ┌───────┬─────────┐
  3. group | letters
  4. --- | ---
  5. u32 | str
  6. ╞═══════╪═════════╡
  7. 0 | a
  8. 2 | c
  9. 2 | d
  10. 2 | e
  11. 0 | f
  12. 0 | m
  13. └───────┴─────────┘

.groupby().agg() can be used to recreate the lists.

.head / .limit to keep only the first N results.

  1. (df.with_row_count(&quot;group&quot;)
  2. .explode(&quot;letters&quot;)
  3. .sort(&quot;letters&quot;)
  4. .filter(pl.count().over(&quot;group&quot;) &gt; 1)
  5. .groupby(&quot;group&quot;)
  6. .agg(pl.col(&quot;letters&quot;).head(2)))
  1. shape: (2, 2)
  2. ┌───────┬────────────┐
  3. group | letters
  4. --- | ---
  5. u32 | list[str]
  6. ╞═══════╪════════════╡
  7. 0 | [&quot;a&quot;, &quot;f&quot;]
  8. 2 | [&quot;c&quot;, &quot;d&quot;]
  9. └───────┴────────────┘

huangapple
  • 本文由 发表于 2023年3月1日 08:51:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75598678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定