英文:
Sort list of datetime in Python polars dataframe
问题
I have problems working with datetime
我的问题出在处理日期时间数据上。
my dataset is like this:
我的数据集如下所示:
And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.
我希望对每一行中的日期进行排序,并获取第一个和第二个日期,因为有些行包含多于2个值。但我无法在Polars上实现这个。
I'm trying with :
我正在尝试使用:
filter_df.filter(
pl.col("order_purchase_timestamp").arr.sort()
)
but I got this error:
但我收到了以下错误信息:
ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))
英文:
I have problems working with datetime
my dataset is like this:
And I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values. But I can't do it on Polars.
I'm trying with :
filter_df.filter(
pl.col("order_purchase_timestamp").arr.sort()
)
but I got this error:
> ComputeError: Filter predicate must be of type Boolean, got: SchemaMisMatch(Owned("Series of dtype: List(Datetime(Microseconds, None)) != Boolean"))
答案1
得分: 3
你可以使用.with_columns()
上下文来选择和修改列。
col = pl.col("order_purchase_timestamp")
df.with_columns(
col.arr.sort(reverse=True).arr.slice(0, 2)
) # / /
# 按日期排序 取片段 [0:2]
从评论讨论中得出的解决方案
要根据条件在列下执行某些操作,您可以使用pl.when -> then -> otherwise
结构。pl.when()
接受某个布尔Series(条件)。了解更多信息这里。
col = pl.col("order_purchase_timestamp")
# 2个日期之间的持续时间(微秒)
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
df.filter(col.arr.lengths() >= 2).with_columns(
col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
less_30 = pl.when(diff <= pl.duration(days=30))\
.then(True).otherwise(False)
)
英文:
You can use .with_columns()
context to select & modify columns.
col = pl.col("order_purchase_timestamp")
df.with_columns(
col.arr.sort(reverse=True).arr.slice(0, 2)
) # / /
# sort by date take slice [0:2]
Solution that follows from the discussion in the comments
To perform some action under column based on condition, you can use pl.when -> then -> otherwise
construct. pl.when()
takes some boolean Series (condition). Read more here.
col = pl.col("order_purchase_timestamp")
# duration (difference) between 2 dates (in microseconds)
diff = pl.duration(microseconds=(col.arr.get(0) - col.arr.get(1)))
df.filter(col.arr.lengths() >= 2).with_columns(
col.arr.sort(reverse=True).arr.slice(0, 2)
).with_columns(
less_30 = pl.when(diff <= pl.duration(days=30))\
.then(True).otherwise(False)
)
答案2
得分: 0
以下是代码部分的翻译:
-
Instead of operating on lists - you could
.explode()
the column.
你可以使用.explode()
列代替操作列表。 -
.with_row_count()
can be used as a "group id":
.with_row_count()
可以用作“group id”(分组标识): -
You could then filter based on the group count ("size"):
之后,你可以基于分组计数进行筛选("size"): -
.groupby().agg()
can be used to recreate the lists.
.groupby().agg()
可用于重新创建列表。 -
.head
/.limit
to keep only the firstN
results.
使用.head
/.limit
来保留前N
个结果。
英文:
Instead of operating on lists - you could .explode()
the column.
df = pl.DataFrame({"letters": [["f", "a", "m"], ["b"], ["e", "d", "c"]]})
shape: (3, 1)
┌─────────────────┐
│ letters │
│ --- │
│ list[str] │
╞═════════════════╡
│ ["f", "a", "m"] │
│ ["b"] │
│ ["e", "d", "c"] │
└─────────────────┘
.with_row_count()
can be used as a "group id":
df.with_row_count("group").explode("letters").sort("letters")
shape: (7, 2)
┌───────┬─────────┐
│ group | letters │
│ --- | --- │
│ u32 | str │
╞═══════╪═════════╡
│ 0 | a │
│ 1 | b │
│ 2 | c │
│ 2 | d │
│ 2 | e │
│ 0 | f │
│ 0 | m │
└───────┴─────────┘
You could then filter based on the group count ("size"):
(df.with_row_count("group")
.explode("letters")
.sort("letters")
.filter(pl.count().over("group") > 1))
shape: (6, 2)
┌───────┬─────────┐
│ group | letters │
│ --- | --- │
│ u32 | str │
╞═══════╪═════════╡
│ 0 | a │
│ 2 | c │
│ 2 | d │
│ 2 | e │
│ 0 | f │
│ 0 | m │
└───────┴─────────┘
.groupby().agg()
can be used to recreate the lists.
.head
/ .limit
to keep only the first N
results.
(df.with_row_count("group")
.explode("letters")
.sort("letters")
.filter(pl.count().over("group") > 1)
.groupby("group")
.agg(pl.col("letters").head(2)))
shape: (2, 2)
┌───────┬────────────┐
│ group | letters │
│ --- | --- │
│ u32 | list[str] │
╞═══════╪════════════╡
│ 0 | ["a", "f"] │
│ 2 | ["c", "d"] │
└───────┴────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论