英文:
How to merge two DataFrames based on date range in polars?
问题
你有两个表格,表格1包含低频数据,表格2包含高频数据。
你想要将对应的低频数据填入高频数据表格。
要求是将表格2的event_date列中的日期与表格1的start_date和end_date之间的日期进行匹配。
如果日期在这个范围内,应将低频数据表中的相应行插入到高频数据表的相应行,确保两个表中的类型一致。
我应该如何使用Polars来实现这个操作?
英文:
You have two tables, with Table 1 containing low-frequency data and Table 2 containing high-frequency data.
You want to fill the corresponding low-frequency data into the high-frequency data table.
The requirement is to match the dates in the event_date column of Table 2 with the dates between the start_date and end_date in Table 1.
If a date falls within this range, the corresponding row from the low-frequency data should be inserted into the corresponding row of the high-frequency data, ensuring that the type in both tables aligns.
How should I use Polars to do this?
Table 1:
shape: (21_803_347, 7)
┌─────────┬────────────┬──────┬─────────┬─────────┬────────────┬────────────┐
│ type ┆ year ┆ qtr ┆ rating1 ┆ rating2 ┆ start_date ┆ end_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i16 ┆ i8 ┆ f64 ┆ f64 ┆ date ┆ date │
╞═════════╪════════════╪══════╪═════════╪═════════╪════════════╪════════════╡
│ 000001 ┆ 1995 ┆ 4 ┆ null ┆ 0.04 ┆ 1996-03-14 ┆ 1996-08-28 │
│ 000001 ┆ 1996 ┆ 4 ┆ null ┆ 0.04 ┆ 1997-04-21 ┆ 1997-08-27 │
│ 000001 ┆ 1997 ┆ 2 ┆ null ┆ 0.03 ┆ 1997-08-28 ┆ 1998-03-11 │
│ 000001 ┆ 1997 ┆ 4 ┆ null ┆ 0.02 ┆ 1998-03-12 ┆ 1998-08-25 │
│ 000001 ┆ 1998 ┆ 2 ┆ null ┆ 0.05 ┆ 1998-08-26 ┆ 1999-04-22 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 600883 ┆ 1998 ┆ 2 ┆ null ┆ null ┆ 1998-08-14 ┆ 1999-03-22 │
│ 000792 ┆ 2000 ┆ 2 ┆ null ┆ null ┆ 2000-08-22 ┆ 2001-03-13 │
│ 000973 ┆ 2023 ┆ 1 ┆ null ┆ null ┆ 2023-04-29 ┆ 2023-08-26 │
│ 300019 ┆ 2009 ┆ 4 ┆ null ┆ null ┆ 2010-02-10 ┆ 2010-04-25 │
│ 600919 ┆ 2016 ┆ 4 ┆ null ┆ null ┆ 2017-03-20 ┆ 2017-04-26 │
└─────────┴────────────┴──────┴─────────┴─────────┴────────────┴────────────┘
Table 2:
shape: (14_061_391, 5)
┌─────────┬────────────┬─────────┬─────────┬─────────┐
│ type ┆ event_date ┆ rating1 ┆ rating2 ┆ rating3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪════════════╪═════════╪═════════╪═════════╡
│ 000001 ┆ 1991-06-26 ┆ null ┆ 0.008 ┆ null │
│ 000001 ┆ 1991-06-27 ┆ null ┆ 0.008 ┆ null │
│ 000001 ┆ 1991-06-28 ┆ null ┆ 0.008 ┆ null │
│ 000001 ┆ 1991-07-01 ┆ null ┆ 0.008 ┆ null │
│ 000001 ┆ 1991-07-02 ┆ null ┆ 0.008 ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 900957 ┆ 2023-04-24 ┆ 0.259 ┆ 0.035 ┆ 0.015 │
│ 900957 ┆ 2023-04-25 ┆ 0.234 ┆ 0.035 ┆ 0.015 │
│ 900957 ┆ 2023-04-26 ┆ 0.255 ┆ 0.035 ┆ 0.015 │
│ 900957 ┆ 2023-04-27 ┆ 0.238 ┆ 0.035 ┆ 0.015 │
│ 900957 ┆ 2023-04-28 ┆ 0.234 ┆ 0.035 ┆ 0.015 │
└─────────┴────────────┴─────────┴─────────┴─────────┘
答案1
得分: 2
执行左连接,只连接类型列,然后应用日期处理的筛选条件,为了提高效率,可以使用延迟模式:
df1.lazy().join(df2.lazy(), on="type", how="left").filter(pl.col("event_date").is_between("start_date", "end_date")).collect()
英文:
It is probably easiest to do a left join on just the type column, and then apply a filter for the date handling. To make this efficient, you can use lazy mode:
df1.lazy().join(df2.lazy(), on="type", how="left").filter(pl.col("event_date").is_between("start_date", "end_date")).collect()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论