如何根据日期范围在 polars 中合并两个 DataFrame?

huangapple go评论98阅读模式
英文:

How to merge two DataFrames based on date range in polars?

问题

你有两个表格,表格1包含低频数据,表格2包含高频数据。

你想要将对应的低频数据填入高频数据表格。

要求是将表格2的event_date列中的日期与表格1的start_date和end_date之间的日期进行匹配。

如果日期在这个范围内,应将低频数据表中的相应行插入到高频数据表的相应行,确保两个表中的类型一致。

我应该如何使用Polars来实现这个操作?

英文:

You have two tables, with Table 1 containing low-frequency data and Table 2 containing high-frequency data.

You want to fill the corresponding low-frequency data into the high-frequency data table.

The requirement is to match the dates in the event_date column of Table 2 with the dates between the start_date and end_date in Table 1.

If a date falls within this range, the corresponding row from the low-frequency data should be inserted into the corresponding row of the high-frequency data, ensuring that the type in both tables aligns.

How should I use Polars to do this?

Table 1:

shape: (21_803_347, 7)
┌─────────┬────────────┬──────┬─────────┬─────────┬────────────┬────────────┐
│ type    ┆ year       ┆ qtr  ┆ rating1 ┆ rating2 ┆ start_date ┆ end_date   │
│ ---     ┆ ---        ┆ ---  ┆ ---     ┆ ---     ┆ ---        ┆ ---        │
│ str     ┆ i16        ┆ i8   ┆ f64     ┆ f64     ┆ date       ┆ date       │
╞═════════╪════════════╪══════╪═════════╪═════════╪════════════╪════════════╡
│ 000001  ┆ 1995       ┆ 4    ┆ null    ┆ 0.04    ┆ 1996-03-14 ┆ 1996-08-28 │
│ 000001  ┆ 1996       ┆ 4    ┆ null    ┆ 0.04    ┆ 1997-04-21 ┆ 1997-08-27 │
│ 000001  ┆ 1997       ┆ 2    ┆ null    ┆ 0.03    ┆ 1997-08-28 ┆ 1998-03-11 │
│ 000001  ┆ 1997       ┆ 4    ┆ null    ┆ 0.02    ┆ 1998-03-12 ┆ 1998-08-25 │
│ 000001  ┆ 1998       ┆ 2    ┆ null    ┆ 0.05    ┆ 1998-08-26 ┆ 1999-04-22 │
│ …       ┆ …          ┆ …    ┆ …       ┆ …       ┆ …          ┆ …          │
│ 600883  ┆ 1998       ┆ 2    ┆ null    ┆ null    ┆ 1998-08-14 ┆ 1999-03-22 │
│ 000792  ┆ 2000       ┆ 2    ┆ null    ┆ null    ┆ 2000-08-22 ┆ 2001-03-13 │
│ 000973  ┆ 2023       ┆ 1    ┆ null    ┆ null    ┆ 2023-04-29 ┆ 2023-08-26 │
│ 300019  ┆ 2009       ┆ 4    ┆ null    ┆ null    ┆ 2010-02-10 ┆ 2010-04-25 │
│ 600919  ┆ 2016       ┆ 4    ┆ null    ┆ null    ┆ 2017-03-20 ┆ 2017-04-26 │
└─────────┴────────────┴──────┴─────────┴─────────┴────────────┴────────────┘

Table 2:

shape: (14_061_391, 5)
┌─────────┬────────────┬─────────┬─────────┬─────────┐
│ type    ┆ event_date ┆ rating1 ┆ rating2 ┆ rating3 │
│ ---     ┆ ---        ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ date       ┆ f64     ┆ f64     ┆ f64     │
╞═════════╪════════════╪═════════╪═════════╪═════════╡
│ 000001  ┆ 1991-06-26 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-06-27 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-06-28 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-07-01 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-07-02 ┆ null    ┆ 0.008   ┆ null    │
│ …       ┆ …          ┆ …       ┆ …       ┆ …       │
│ 900957  ┆ 2023-04-24 ┆ 0.259   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-25 ┆ 0.234   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-26 ┆ 0.255   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-27 ┆ 0.238   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-28 ┆ 0.234   ┆ 0.035   ┆ 0.015   │
└─────────┴────────────┴─────────┴─────────┴─────────┘

答案1

得分: 2

执行左连接,只连接类型列,然后应用日期处理的筛选条件,为了提高效率,可以使用延迟模式:

df1.lazy().join(df2.lazy(), on="type", how="left").filter(pl.col("event_date").is_between("start_date", "end_date")).collect()
英文:

It is probably easiest to do a left join on just the type column, and then apply a filter for the date handling. To make this efficient, you can use lazy mode:

df1.lazy().join(df2.lazy(), on="type", how="left").filter(pl.col("event_date").is_between("start_date", "end_date")).collect()

huangapple
  • 本文由 发表于 2023年5月18日 00:22:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76274237.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定