英文:
Polars - How to find all the records in a dataframe which have at least one in-kind occurrence between two given datetimes?
问题
你想要为每个供应商返回在店铺营业时间内至少有一笔交易记录的所有产品交易记录。下面是你可以使用Python Pandas来实现这个目标的一种有效方式:
import pandas as pd
# 创建store hours dataframe
store_hours_data = {
'vendorId': [115, 1146, 18143, 19122, 7744, 9092, 5801, 2378],
'opening time': ['2020-02-25 12:00:01', '2020-02-16 12:00:11', '2020-02-25 12:00:16', '2020-02-16 12:00:19', '2020-02-16 12:00:11', '2020-02-16 12:00:31', '2020-11-01 12:00:41', '2020-02-16 12:00:11'],
'closing time': ['2022-02-25 02:00:02', '2022-02-26 02:00:48', '2022-02-25 02:00:46', '2022-02-16 02:00:42', '2022-02-16 02:00:57', '2022-02-24 02:00:57', '2022-02-19 02:00:36', '2022-02-16 02:00:47']
}
store_hours_df = pd.DataFrame(store_hours_data)
# 创建transactions dataframe
transactions_data = {
'vendor': [7744, 7744, 7744, 2378, 2378, 115],
'prod': [12, 13, 11, 20, 20, 3],
'saletime': ['2020-02-25 12:10:01', '2020-02-25 12:15:01', '2023-02-25 12:17:01', '2020-02-16 12:18:01', '2022-02-25 12:20:01', '2022-02-26 12:10:01'],
'value': [10.0, 12.0, 10.0, 10.0, 10.0, 12.0],
'count': [6, 7, 4, 2, 2, 5],
'quality': [13.7, 21.9, 13.0, 14.0, 14.0, 19.0]
}
transactions_df = pd.DataFrame(transactions_data)
# 将saletime列转换为datetime类型
store_hours_df['opening time'] = pd.to_datetime(store_hours_df['opening time'])
store_hours_df['closing time'] = pd.to_datetime(store_hours_df['closing time'])
transactions_df['saletime'] = pd.to_datetime(transactions_df['saletime'])
# 使用merge操作来获取符合条件的交易记录
merged_df = pd.merge(transactions_df, store_hours_df, how='inner', left_on=['vendor'], right_on=['vendorId'])
filtered_df = merged_df[(merged_df['saletime'] >= merged_df['opening time']) & (merged_df['saletime'] <= merged_df['closing time'])]
# 删除多余的列
filtered_df.drop(['vendorId', 'opening time', 'closing time'], axis=1, inplace=True)
# 打印结果
print(filtered_df)
这段代码将为每个供应商返回在店铺营业时间内至少有一笔交易记录的所有产品交易记录。你可以根据需要对结果进行进一步处理或导出。希望这对你有所帮助!
英文:
I two dataframes: one for store operation dates and one for transactions.
The store hours dataframe is like so
┌───────────┬─────────────────────┬─────────────────────┐
│ vendorId ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] │
╞═══════════╪═════════════════════╪═════════════════════╡
╞═══════════╪═════════════════════╪═════════════════════╡
│ 115 ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
│ 1146 ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
│ 18143 ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
│ 19122 ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
│ … ┆ … ┆ … │
│ 7744 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 09092 ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
│ 5801 ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
│ 2378 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└───────────┴─────────────────────┴─────────────────────┘
The transactions df looks something like this:
┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆datetime ┆ f64 ┆ i64 ┆ f64 │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744 ┆ 11 ┆ 2023-02-25 12:17:01 ┆ 10.0 ┆ 4 ┆ 13.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 115 ┆ 3 ┆ 2022-02-26 12:10:01 ┆ 12.0 ┆ 5 ┆ 19.0 │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘
My goal is for each vendor, I want to return all the product transactions which had at least one transactions within the stores hours. For example, for vendor 7744 it would be the following.
┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆datetime ┆ f64 ┆ i64 ┆ f64 │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘
These products were sold within the opening and closing of the store while product 11 occurred after the closing date.
On the other hand for vendor 2378 it should return:
┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆datetime ┆ f64 ┆ i64 ┆ f64 │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘
Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.
The expected final df is
┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆datetime ┆ f64 ┆ i64 ┆ f64 │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘
Is there an efficient way to accomplish this? Any help is appreciated.
答案1
得分: 1
以下是代码的中文翻译部分:
你可以执行一个“左连接”,将供应商的每个营业时间添加到交易数据框中,然后相应地进行“筛选”:
df_transactions.join(df_hours, left_on="vendor", right_on="vendorId").filter(
pl.col("saletime").is_between(pl.col("opening time"), pl.col("closing time"))
)
(如果需要,可以从这里“选择”出营业时间列。)
英文:
You can do a left join
to put every opening / closing hours of the vendor in the transactions df, then filter
accordingly:
df_transactions.join(df_hours, left_on="vendor", right_on="vendorId").filter(
pl.col("saletime").is_between(pl.col("opening time"), pl.col("closing time"))
)
shape: (4, 8)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┬─────────────────────┬─────────────────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 ┆ datetime[μs] ┆ datetime[μs] │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╪═════════════════════╪═════════════════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
│ 2378 ┆ 20 ┆ 2020-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┴─────────────────────┴─────────────────────┘
(Can select
out the opening/closing time columns from here if desired.)
答案2
得分: 0
在你的问题中,你提到:
> 即使第二笔交易发生在闭店日期之后,它仍被视为一笔持续交易,因为交易的一部分发生在商店营业日期内。
所以我认为你希望你的交易看起来更像是这样的:
txns=pl.from_repr(
"""shape: (6, 6)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
│ 7744 ┆ 11 ┆ 2023-02-25 12:17:01 ┆ 10.0 ┆ 4 ┆ 13.0 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
│ 115 ┆ 3 ┆ 2022-02-26 12:10:01 ┆ 12.0 ┆ 5 ┆ 19.0 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┘""")
其中第二行中的2378行具有在小时数据框中的闭店时间之后的交易时间。
hours=pl.from_repr(
"""┌───────────┬─────────────────────┬─────────────────────┐
│ vendorId ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 115 ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
│ 1146 ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
│ 18143 ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
│ 19122 ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
│ … ┆ … ┆ … │
│ 7744 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 09092 ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
│ 5801 ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
│ 2378 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└───────────┴─────────────────────┴─────────────────────┘""")
有了这些数据,如果你想返回所有在开放/关闭时间内至少有一个交易/供应商对的交易,即使特定行不符合条件,你可以在筛选器中使用窗口函数(例如.over
)。
(
txns
.join(hours, left_on="vendor", right_on="vendorId")
.filter(
(
pl.col('saletime')
.is_between(pl.col("opening time"), pl.col("closing time"))
.any()
).over(['vendor','prod']))
.select(txns.columns)
)
形状:(4, 6)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
│ 2378
英文:
In your question you say:
> Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.
so I think you meant for your transactions to look something more like this:
txns=pl.from_repr(
"""shape: (6, 6)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
│ 7744 ┆ 11 ┆ 2023-02-25 12:17:01 ┆ 10.0 ┆ 4 ┆ 13.0 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
│ 115 ┆ 3 ┆ 2022-02-26 12:10:01 ┆ 12.0 ┆ 5 ┆ 19.0 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┘""")
where the second 2378 row has a transaction time after the closing time from the hours df.
hours=pl.from_repr(
"""┌───────────┬─────────────────────┬─────────────────────┐
│ vendorId ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 115 ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
│ 1146 ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
│ 18143 ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
│ 19122 ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
│ … ┆ … ┆ … │
│ 7744 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 09092 ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
│ 5801 ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
│ 2378 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└───────────┴─────────────────────┴─────────────────────┘""")
With that, if you want to return all the transactions where at least one transaction/vendor pair are within the open/close time even if the particular row isn't then you would use a window function (ie. .over
) in your filter.
(
txns
.join(hours, left_on="vendor", right_on="vendorId")
.filter(
(
pl.col('saletime')
.is_between(pl.col("opening time"), pl.col("closing time"))
.any()
).over(['vendor','prod']))
.select(txns.columns)
)
shape: (4, 6)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┘
答案3
得分: 0
以下是要翻译的内容:
"There is also .join_asof
which is an optimized join for this type of task.
e.g. https://stackoverflow.com/questions/73176563/python-polars-join-on-column-with-greater-or-equal/
df_transactions.join_asof(
df_hours,
left_on="saletime",
right_on="opening time",
by_left="vendor",
by_right="vendorId",
).filter(
pl.col("saletime")
.is_between(pl.col("opening time"), pl.col("closing time"))
.any()
.over("vendor", "prod")
)
shape: (4, 8)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┬─────────────────────┬─────────────────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 ┆ datetime[μs] ┆ datetime[μs] │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╪═════════════════════╪═════════════════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┴─────────────────────┴─────────────────────┘
```"
<details>
<summary>英文:</summary>
There is also [`.join_asof`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) which is an optimized join for this type of task.
e.g. https://stackoverflow.com/questions/73176563/python-polars-join-on-column-with-greater-or-equal/
df_transactions.join_asof(
df_hours,
left_on="saletime",
right_on="opening time",
by_left="vendor",
by_right="vendorId",
).filter(
pl.col("saletime")
.is_between(pl.col("opening time"), pl.col("closing time"))
.any()
.over("vendor", "prod")
)
shape: (4, 8)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┬─────────────────────┬─────────────────────┐
│ vendor ┆ prod ┆ saletime ┆ value ┆ count ┆ quality ┆ opening time ┆ closing time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ datetime[μs] ┆ f64 ┆ i64 ┆ f64 ┆ datetime[μs] ┆ datetime[μs] │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╪═════════════════════╪═════════════════════╡
│ 7744 ┆ 12 ┆ 2020-02-25 12:10:01 ┆ 10.0 ┆ 6 ┆ 13.7 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 7744 ┆ 13 ┆ 2020-02-25 12:15:01 ┆ 12.0 ┆ 7 ┆ 21.9 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 2378 ┆ 20 ┆ 2020-02-16 12:18:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
│ 2378 ┆ 20 ┆ 2022-02-25 12:20:01 ┆ 10.0 ┆ 2 ┆ 14.0 ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┴─────────────────────┴─────────────────────┘
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论